Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing
Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing
I'm fascinated by how the economy is catching up to demand for inference. The vast majority of today's capacity comes from silicon that merely happens to be good at inference, and it's clear that there's a lot of room for innovation when you design silicon for inference from the ground up.
With CapEx going crazy, I wonder where costs will stabilize and what OpEx will look like once these initial investments are paid back (or go bust). The common consensus seems to be that there will be a rug pull and frontier model inference costs will spike, but I'm not entirely convinced.
I suspect it largely comes down to how much more efficient custom silicon is compared to GPUs, as well as how accurately the supply chain is able to predict future demand relative to future efficiency gains. To me, it is not at all obvious what will happen. I don't see any reason why a rug pull is any more or less likely than today's supply chain over-estimating tomorrow's capacity needs, and creating a hardware (and maybe energy) surplus in 5-10 years.
Nvidia seems cooked.
Google is crushing them on inference. By TPUv9, they could be 4x more energy efficient and cheaper overall (even if Nvidia cuts their margins from 75% to 40%).
Cerebras will be substantially better for agentic workflows in terms of speed.
And if you don't care as much about speed and only cost and energy, Google will still crush Nvidia.
And Nvidia won't be cheaper for training new models either. The vast majority of chips will be used for inference by 2028 instead of training anyway.
Nvidia has no manufacturing reliability story. Anyone can buy TSMC's output.
Power is the bottleneck in the US (and everywhere besides China). By TPUv9 - Google is projected to be 4x more energy efficient. It's a no-brainer who you're going with starting with TPUv8 when Google lets you run on-prem.
These are GW scale data centers. You can't just build 4 large-scale nuclear power plants in a year in the US (or anywhere, even China). You can't just build 4 GW solar farms in a year in the US to power your less efficient data center. Maybe you could in China (if the economics were on your side, but they aren't). You sure as hell can't do it anywhere else (maybe India).
What am I missing? I don't understand how Nvidia could've been so far ahead and just let every part of the market slip away.
> let every part of the market slip away.
Which part of the market has slept away, exactly ? Everything you wrote is supposition and extrapolation. Nvidia has a chokehold on the entire market. All other players still exist in the small pockets that Nvidia doesn’t have enough production capacity to serve. And their dev ecosystem is still so far ahead of anyone else. Which providers gets chosen to equip a 100k chips data center goes so far beyond the raw chip power.
If code is getting cheaper, making cuda alternatives and tooling should not be very far. I can’t see nvidia holding the position for much longer.
> Nvidia has a chokehold on the entire market.
You're obviously not looking at expected forward orders for 2026 and 2027.
I think most estimates have Nvidia at more or less stable share of CoWoS capacity (around 60%), which is ~doubling in '26.
> What am I missing?
Largest production capacity maybe?
Also, market demand will be so high that every player's chips will be sold out.
> Largest production capacity maybe?
Anyone can buy TSMC's output...
Which I'm sure is 100% reserved through at least 2030.
Aren't they building new fabs, though? Or even those are already booked?
Can anyone buy TSMC though?
No. TSMC will not take the risk on allocating capacity to just anyone given the opportunity cost.
Not without an army
Man I hope someone drinks Nvidia's milk shake. They need to get humbled back to the point where they're desperate to sell gpus to consumers again.
Only major road block is cuda...
What puzzles me is that AMD can't secure any meaningful size of AI market. They missed this train badly.
> What am I missing?
VRAM capacity given the Cerebras/Groq architecture compared to Nvidia.
In parallel, RAM contracts that Nvidia has negotiated well into the future that other manufacturers have been unable to secure.
I believe they licensed smth from groq
Well they `acquired` groq for a reason.
It's "dinner-plate sized" because it's just a full silicon wafer. It's nice to see that wafer-scale integration is now being used for real work but it's been researched for decades.
If history has taught us anything, “engineered systems” (like mainframes & hyper converged infrastructure) emerge at the start of a new computing paradigm … but long-term, commodity compute wins the game.
I think that was true when you could rely on good old Moore’s law to make the heavy iron quickly obsolete but I also think those days are coming to an end
Technically, Cerebras solution is really cool. However, I am skeptical that it will be economically useful for models that are larger in size, as the requirements on the number of racks scales with the the size of the model to fit the weights in SRAM.
Just wish they weren't so insanely expensive...
The bigger the chip, the worse the yield.
Cerebras has effectively 100% yield on these chips. They have an internal structure made by just repeating the same small modular units over and over again. This means they can just fuse off the broken bits without affecting overall function. It's not like it is with a CPU.
I suggest to read their website, they explain pretty well how they manage good yield. Though I’m not an expert in this field. I does make sense and I would be surprised if they were caught lying.
This comment doesn't make sense.
One wafer will turn into multiple chips.
Defects are best measured on a per-wafer basis, not per-chip. So if if your chips are huge and you can only put 4 chips on a wafer, 1 defect can cut your yield by 25%. If they're smaller and you fit 100 chips on a wafer, then 1 defect on the wafer is only cutting yield by 1%. Of course, there's more to this when you start reading about "binning", fusing off cores, etc.
There's plenty of information out there about how CPU manufacturing works, why defects happen, and how they're handled. Suffice to say, the comment makes perfect sense.
That's why you typically fuse off defective sub-units and just have a slightly slower chip. GPU and CPU manufacturers have done this for at least 15 years now, that I'm aware of.
Sure it does. If it’s many small dies on a wafer, then imperfections don’t ruin the entire batch; you just bin those components. If the entire wafer is a single die, you have much less tolerance for errors.
Although, IIUC, Cerebras expects some amount of imperfection and can adjust the hardware (or maybe the software) to avoid those components after they're detected. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cereb...
You can just do dynamic binning.
Bigger chip = more surface area = higher chance for somewhere in the chip to have a manufacturing defect
Yields on silicon are great, but not perfect
Does that mean smaller chips are made from smaller wafers?
They can be made from large wafers. A defect typically breaks whatever chip it's on, so one defect on a large wafer filled with many small chips will still just break one chip of the many on the wafer. If your chips are bigger, one defect still takes out a chip, but now you've lost more of the wafer area because the chip is bigger. So you get a super-linear scaling of loss from defects as the chips get bigger.
With careful design, you can tolerate some defects. A multi-core CPU might have the ability to disable a core that's affected by a defect, and then it can be sold as a different SKU with a lower core count. Cerebras uses an extreme version of this, where the wafer is divided up into about a million cores, and a routing system that can bypass defective cores.
They have a nice article about it here: https://www.cerebras.ai/blog/100x-defect-tolerance-how-cereb...
Nope. They use the same size wafers and then just put more chips on a wafer.
So, does a wafer with a huge chip has more defects per area than a wafer with 100s of small chips?
There’s an expected amount of defects per wafer. If a chip has a defect, then it is lost (simplification). A wafer with 100 chips may lose 10 to defects, giving a yield of 90%. The same wafer but with 1000 smaller chips would still have lost only 10 of them, giving 99% yield.
As another comment referenced in this thread states, Cerebras seems to have solved by making their big chip a lot of much smaller cores that can be disposed of if they have errors.
Indeed, the original comment you replied to actually made no sense in this case. But there seemed to be some confusion in the thread, so I tried to clear that up. I hope I’ll get to talk with one of the cerebras engineers one day, that chip is really one of a kind.
You say this with such confidence and then ask if smaller chips require smaller wafers.
Not for what they are using it for. It is $1m+/chip and they can fit 1 of them in a rack. Rack space in DC's is a premium asset. The density isn't there. AI models need tons of memory (this product annoucement is case in point) and they don't have it, nor do they have a way to get it since they are last in line at the fabs.
Their only chance is an aquihire, but nvidia just spent $20b on groq instead. Dead man walking.
The real question is what’s their perf/dollar vs nvidia?
That's coupling two different usecases.
Many coding usecases care about tokens/second, not tokens/dollar.
I guess it depends what you mean by "perf". If you optimize everything for the absolutely lowest latency given your power budget, your throughput is going to suck - and vice versa. Throughput is ultimately what matters when everything about AI is so clearly power-constrained, latency is a distraction. So TPU-like custom chips are likely the better choice.
By perf I mean how much does it cost to serve 1T model to 1M users at 50 tokens/sec.
All 1T models are not equal. E.g. how many active parameters? what's the native quantization? how long is the max context? Also, it's quite likely that some smaller models in common use are even sub-1T. If your model is light enough, the lower throughput doesn't necessarily hurt you all that much and you can enjoy the lightning-fast speed.
Just pick some reasonable values. Also, keep in mind that this hardware must still be useful 3 years from now. What’s going to happen to cerebras in 3 years? What about nvidia? Which one is a safer bet?
On the other hand, competition is good - nvidia can’t have the whole pie forever.
> Just pick some reasonable values.
And that's the point - what's "reasonable" depends on the hardware and is far from fixed. Some users here are saying that this model is "blazing fast" but a bit weaker than expected, and one might've guessed as much.
> On the other hand, competition is good - nvidia can’t have the whole pie forever.
Sure, but arguably the closest thing to competition for nVidia is TPUs and future custom ASICs that will likely save a lot on energy used per model inference, while not focusing all that much on being super fast.
AMD
[dead]
> Throughput is ultimately what matters
I disagree. Yes it does matter, but because the popular interface is via chat, streaming the results of inference feels better to the squishy messy gross human operating the chat, even if it ends up taking longer. You can give all the benchmark results you want, humans aren't robots. They aren't data driven, they have feelings, and they're going to go with what feels better. That isn't true for all uses, but time to first byte is ridiculously important for human-computer interaction.
You just have to change the "popular interface" to something else. Chat is OK for trivia or genuinely time-sensitive questions, everything else goes through via email or some sort of webmail-like interface where requests are submitted and replies come back asynchronously. (This is already how batch APIs work, but they only offer a 50% discount compared to interactive, which is not enough to really make a good case for them - especially not for agentic workloads.)
Or Google TPUs.
TPUs don't have enough memory either, but they have really great interconnects, so they can build a nice high density cluster.
Compare the photos of a Cerebras deployment to a TPU deployment.
https://www.nextplatform.com/wp-content/uploads/2023/07/cere...
https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iOLs2FEQxQv...
The difference is striking.
Oh wow the cabling in the first link is really sloppy!
Exactly. They won't ever tell you. It is never published.
Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.
Power/cooling is the premium.
Can always build a bigger hall
Exactly my point. Their architecture requires someone to invest the capex / opex to also build another hall.
Oh don't worry. Ever since the power issue started developing rack space is no longer at a premium. Or at least, it's no longer the limiting factor. Power is.
The dirty secret is that there is plenty of power. But, it isn't all in one place and it is often stranded in DC's that can't do the density needed for AI compute.
Training models needs everything in one DC, inference doesn't.
Yet investors keep backing NVIDIA.
At this point Tech investment and analysis is so divorced from any kind of reality that it's more akin to lemmings on the cliff than careful analysis of fundamentals
yep
Cerebras is a bit of a stunt like "datacenters in spaaaaace".
Terrible yield: one defect can ruin a whole wafer instead of just a chip region. Poor perf./cost (see above). Difficult to program. Little space for RAM.
They claim the opposite, though, saying the chip is designed to tolerate many defects and work around them.