Maybe I'm being dumb, but I don't understand what the innovation is here.
I get that they're using liquid coolant at higher than usual temperatures, but why couldn't they do that before? Most of the comparison in the article is for air cooled datacenters but what about other liquid cooled ones?
Surely in all the previous datacenters that have been designed there has been someone doing the math and determining what temperature things need to run at, how much energy it will use, how much heat it all will produce, etc.
edit: just saw this:
>Previous liquid-cooled servers were hybrid: GPUs and CPUs got cold plates, but the rest of the system stayed air-cooled, with finned heat sinks designed to shed heat into moving air. In a fully liquid-cooled server, the cooling for these components needed to be completely redesigned to use liquid.
The "innovation" is that everything is now attached to a watercooled block.
The rest is marketing: The Cray supercomputer were fluid cooled back in the 1980's, the entire board had an inert liquid flowing across it.
When my grandpa retired from Monsanto chemical back in the 90s, I helped him clean out his office and got a tour of a bunch of stuff.
He showed me their Cray, which had its own dedicated computer room, and they set it up with the coolant pump and fountain unit right in the middle in front of a glass wall facing the hallway so everyone could gawk at it.
The innovation is being able to run the chips at higher temps without ruining them too quickly.
Haven't AMD CPUs been targeting a 95°C limit for 5+ years already? I'd have guessed servers could do 60°C without degrading a whole lot before switching to more power efficient hardware is available.
95˚C is the core temp, not ambient. My parent comment was probably wrong though, see https://news.ycombinator.com/item?id=48667527
Mineral oil is the usual substance used, as its clear and non smelling.
Are pretty much all surface mounted components fine with this?
I wanted to waterproof a micro devboard by submerging it in mineral oil. I was worried the board may delaminate or components would turn to goop
> everything is now attached to a watercooled block
Does it increases manufacturing and operational cost of such racks?
My partner lamented the same thing... Cray was doing this 40+ years ago
Cray used Fluorinert, a chlorofluorocarbon. So not exactly a environmentally friendly solution.
That shouldn’t be a problem then given that we don’t care about environmental impacts anymore.
Bad quality of water clogging the pipes integrated onto the PCBs (thus requiring to replace the PCBs) was said to be what were killing those few USSR Elbrus supercomputer installations.
> Surely in all the previous datacenters that have been designed there has been someone doing the math and determining what temperature things need to run at, how much energy it will use, how much heat it all will produce, etc.
It seemed like a pretty big deal ~ 2011 when big companies were running their (air cooled) datacenters closer to 95F (35C) vs the traditional 72F (22C). So jumping up a little more is maybe not super exciting, but it's still innovation.
And I think the answer to the "doing the math" question is, until you've actually collected the data, "what math?" Until someone actually puts a bunch of six-figure value hardware through its paces, pushes the previous limits, and sees what that does to its lifespan, there's nothing to meaningfully calculate.
And the fact that their system doesn't dump water. I think that is actually perhaps the bigger deal. Datacenters have been getting a lot of heat (pun intended) for using significant fresh water at the expense of local municipalities.
Closed-loop water cooling chips is nothing new. There are two separate water systems that often get conflated*. The loop warms up the water, which is recycled but first needs to be cooled externally somehow. Normally they use evaporative cooling towers that do use water, or chillers that don't use water but use more energy. But they're claiming they can get that water loop so much hotter than the outdoor environment that active cooling isn't needed. They attribute this to improving the chip-to-water interaction.
Even air-cooled datacenters work somewhat the same way, but instead of water to chips, it's air. The air goes into hot aisles then exchanges heat with water, after which, see above.
* Other datacenter marketing materials talk about how they have a "closed loop system that uses no water" and they do still use water in the evap towers. I was half expecting this article to be that again, glad it wasn't.
Just because it's not new doesn't mean that it was available or that the engineering needed to bring it to mass market wasn't significant.
It was available, there are plenty of water-cooled datacenters already, or water-cooled racks fitted into existing sites. Nvidia improved the cooling efficiency though.
You have to design your hardware to tolerate being run in consistently hotter conditions. There's a tradeoff between cooling cost and failure rate / capex.
Doesn't look like they made the hardware more tolerant of temperature, rather they made it remove waste heat more quickly.
"NVIDIA’s thermal engineering team reworked how those components handle heat, designing cooling loops that simplify how liquid is routed to multiple high-power chips on the board using a single inlet and outlet, resulting in a cleaner tray-level cooling architecture"
Nvidia's automotive and aerospace variants get ratings up to 85C, for comparison.
Don’t their consumer GPUs run at 85C core temp? Maybe not for as long though.
AMD CPUs basically all boost up to 90°C as a relatively normal operating temperature as long as the power (and some other factors) allow it to. I assume AMDs and NVs GPUs do to, but I play mostly CPU bound games so I see mine just sitting at ~60°C under load.
Core temp though. Ambient temp is a different story, and also depends on air vs water. In fact the article suggests the difference is getting the water more directly onto the chips, no mention of running at a higher core temp.
Temperature ratings are the allowed ambient temperature. The actual silicon will inevitably operate somewhat higher, because coolers are just moving heat down a temperature gradient.
Speculating here - “effectively” cooling the CPU and GPU materially using this technique at datacenter scale may have never been done. Those things than run hot, easily crossing 100C. So the loop is doing a lot of work to keep them stable at 55C.
The innovation may be in the speed or volume flow of the coolant through different parts of the data centre to regulate the temperature. And of course, redesigning every component to be compatible with this fan-less design.
I think it’s only possibly because NVIDIA is much more vertically integrated than ever before.
There's never been a reason a sealed water-cooled system ever had to use vast amounts of water. But State Of The Art wound up being using and expelling the water. It seems like data centers operate like other industrial enterprises - locate in the city/county/state that gives you carte blanche, do whatever is convenient, get used to the idea that this the only way things can be done.
So a multitude of communities rebelling and complaints about environmental damage fell on deaf ears but a technical spec might be paid attention to.
Is this not how it was already done? Huh.
Ai slop from Nvidia, who would have thunk.