> "Downside is this chip would be huuuuge - a whole wafer."
Why don't we have chips like that? If a CPU the size of a postage stamp can do x amount of performance, imagine how much performance you could get if you used an entire wafer of chips running in parallel. Obviously there would be certain use cases, like you couldn't fit an entire wafer in a phone, but still
Using the space of an entire wafer for one chip would result in extremely low manufacturing yields. Even with state of the art silicon cleanrooms, there will still be defects in parts of the output.
With CPUs and GPUs, chip makers can disable faulty cores and bin them as lower SKUs to get some yield out of it. But if you're using an entire wafer to embed weights, and a speck of dust causes a printing defect that makes the weights wrong, the entire wafer is worthless.
Do failed wafers have to go in the trash, or can you recycle them?
You can grind some of the raw silicon out of a finished wafer but I don't think it'd be suitable to use in another batch of product. So instead of having the weights on the wafer like OOP was suggesting, hardware inference has been trending toward having a wafer of lots and lots of smaller cores with fast SRAM.
What's the difference between disabling faulty cores and disabling the parts of the wafer that have defects?
I'm not an expert, but I think those are the same thing. But for an LLM etched onto a whole wafer, it doesn't make sense to disable part of it since that would remove some weights entirely.
Is that defect easy to detect?
We do. The Cerebras line of Wafer Scale Engines is exactly an entire wafer of cores running in parallel with fast memory next to each one. It's intended for very high throughput LLM inference. https://www.cerebras.ai/chip