Wow, I'm staggered, thanks for sharing

I was under the impression that often times chip manufacture at the top of the lines failed to be manufactured perfectly to spec and those with say, a core that was a bit under spec or which were missing a core would be down clocked or whatever and sold as the next in line chip.

Is that not a thing anymore? Or would a chip like this maybe be so specialized that you'd use say a generation earners transistor width and thus have more certainty of a successful cast?

Or does a chip this size just naturally ebb around 900,000 cores and that's not always the exact count?

20kwh! Wow! 900,000 cores. 125 teraflops of compute. Very neat

Designing to tolerate the defects is well trodden territory. You just expect some rate of defects and have a way of disabling failing blocks.

So you shoot for 10% more cores and disable failing cores?

More or less, yes. Of course, defects are not evenly distributed, so you get a lot of chips with different grades of brokenness. Normally the more broken chips gets sold off as lower tier products. A six core CPU is probably an eight core with two broken cores.

Though in this case, it seems [1] that Cerebras just has so many small cores they can expect a fairly consistent level of broken cores and route around them

[1]: https://www.cerebras.ai/blog/100x-defect-tolerance-how-cereb...

Well, it's more like they have 900,000 cores on a WSE and disable whatever ones that don't work.

Seriously, that's literally just what they do.

IIRC, a lot of design went into making it so that you can disable parts of this chip selectively.