Their struggle with Nvidia driver bugs they had to work around was very relatable. You'd think if someone buys 10,752 of their high-end GPUs you'd get some support with it.

Agreed, but the problem seems to be even worse with AMD from what I hear, or at least it was when I checked with some of my HPC buddies a little over a year ago. Constant driver bugs and crickets from upstream "support".

no, you have to pay the yearly per gpu license for that.

did I miss a blog on this?

we didn't have time to write one yet, but there is the tech report which has a lot of details already

Report is packed with interesting details. Engineering challenges and solutions chapter especially show how things which are supposed and expected to work break when put through a massive scale. Really difficult bugs. Great writeup.

thank you!