This whole article is built off using DeepSeek R1, which is a huge premise that I don't think is correct. DeepSeek is much more efficient and I don't think it's a valid way to estimate what OpenAI and Anthropic's costs are.

https://www.wheresyoured.at/deep-impact/

Basically, DeepSeek is _very_ efficient at inference, and that was the whole reason why it shook the industry when it was released.

DeepSeek inference efficiency comes from two things: MoE and MLA attention. OpenAI was rumored to use MoE around GPT4 moment, I.e loooong time ago.

Given Gemini efficiency with long context I would bet their attention is very efficient too.

GPT OSS uses fp4, which DeepSeek doesn’t use yet btw.

So no, big labs aren’t behind DeepSeek in efficiency. Not by much at least.

The reason it shook the market at least was because of the claim that its training cost was 5 million.

That' what the buzz focused on, strange as we don't actually know what it cost them. While inference optimization is a fact and is even more impactful since training costs benefit from economics of scale.

I don't think that's strange at all, it's a much more palatable narrative for the mass who doesn't know what inference and training is and who think having conversations=training

I agree nothing surprising in that, also back then inference wasn't as much questioned as today with regards to being sold at a loss.

Also the fact that it cost 10% of what other models cost. Pretty much still does.

Why would you think that deepseek is more efficient than gpt-5/Claude 4 though? There's been enough time to integrate the lessons from deepseek.

Because to make GPT-5 or Claude better than previous models, you need to do more reasoning which burns a lot more tokens. So, your per-token costs may drop, but you may also need a lot more tokens.

GPT-5 can be configured extensively. Is there any point at which any configuration of GPT-5 that offers ~DeepSeek level performance is more expensive than DeepSeek per token?

The "efficiency" meantioned in blog post you have linked is the price difference between Deepseek and o1, it doesn't mean that GPT-5 or other SOTA models are less efficient.

Uhhh, I'm pretty sure DeepSeek shook the industry because of a 14x reduction in training cost, not inference cost.

We also don't know the per-token cost for OpenAI and Anthropic models, but I would be highly surprised if it was significantly more expensive than open models anyone can use and run themselves. It's not like they're also not investing in inference research.

DeepSeek was trained with distillation. Any accurate estimate of training costs should include the training costs of the model that it was distilling.

That makes the calculation nonsensical, because if you go there... you'd also have to include all energy used in producing the content the other model providers used. So now suddenly everyones devices on which they wrote comments on social media, pretty much all servers to have ever served a request to open AI/Google/anthropics bots etc pp

Seriously, that claim was always completely disingenuous

I don't think it's that nonsensical to realize that in order to have AI, you need generations of artists, journalists, scientists, and librarians to produce materials to learn from.

And when you're using an actual AI model to "train" (copy), it's not even a shred of nonsense to realize the prior model is a core component of the training.

Not just energy cost, but also licensing cost of all this content…

Isn't training cost a function of inference cost? From what I gathered, they reduced both.

I remember seeing lots of videos at the time explaining the details, but basically it came down to the kind of hardware-aware programming that used to be very common. (Although they took it to the next level by using undocumented behavior to their advantage.)

They're typically somewhat related but the difference between training and inference can vary greatly so, i guess the answer is no.

they did reduce both though and mostly due to reduced precision

Because of the alleged reduction in training costs.

All reports by companies are alleged until verified by other, more trustworthy sources. I don't think it's especially notable that it's alleged because it's DeepSeek vs. the alleged numbers from other companies.

What are we meant to take away from the 8000 word Zitron post?

In any case, here is what Anthropic CEO Dario Amodei said about DeepSeek:

"DeepSeek produced a model close to the performance of US models 7-10 months older, for a good deal less cost (but not anywhere near the ratios people have suggested)"

"DeepSeek-V3 is not a unique breakthrough or something that fundamentally changes the economics of LLM’s; it’s an expected point on an ongoing cost reduction curve. What’s different this time is that the company that was first to demonstrate the expected cost reductions was Chinese."

https://www.darioamodei.com/post/on-deepseek-and-export-cont...

We certainly don't have to take his word for it, but the claim is that DeepSeek's models are not much more efficient to train or inference than closed models of comparable quality. Furthermore, both Amodei and Sam Altman have recently claimed that inference is profitable:

Amodei: "If you consider each model to be a company, the model that was trained in 2023 was profitable. You paid $100 million, and then it made $200 million of revenue. There's some cost to inference with the model, but let's just assume, in this cartoonish cartoon example, that even if you add those two up, you're kind of in a good state. So, if every model was a company, the model, in this example, is actually profitable.

What's going on is that at the same time as you're reaping the benefits from one company, you're founding another company that's much more expensive and requires much more upfront R&D investment. And so the way that it's going to shake out is this will keep going up until the numbers go very large and the models can't get larger, and then it'll be a large, very profitable business, or, at some point, the models will stop getting better, right? The march to AGI will be halted for some reason, and then perhaps it'll be some overhang. So, there'll be a one-time, 'Oh man, we spent a lot of money and we didn't get anything for it.' And then the business returns to whatever scale it was at."

https://cheekypint.substack.com/p/a-cheeky-pint-with-anthrop...

Altman: "If we didn’t pay for training, we’d be a very profitable company."

https://www.theverge.com/command-line-newsletter/759897/sam-...

In terms of sources, I would trust Zitron a lot more than Altman or Amodei. To be charitable, those CEOs are known for their hyperbole and for saying whatever is convenient in the moment, but they certainly aren't that careful about being precise or leaving out inconvenient details. Which is what a CEO should do, more or less, but, I wouldn't trust their word on most things.

I agree we should not take CEOs at their word, we have to think about whether what they're saying is more likely to be true than false given other things we know. But to trust Zitron on anything is ridiculous. He is not a source at all: he knows very little, does zero new reporting, and frequently contradicts himself in his frenzy to believe the bubble is about to pop any time now. A simple example: claiming both that "AI is very little of big tech revenue" and "Big tech has no other way to show growth other than AI hype". Both are very nearly direct quotes.

Those two statements are not contradictory, and thinking that they are belies a pretty fundamental misunderstanding of his basic thesis.

The first statement is one about the present value of AI. The second statement is about their belief of the future value of AI.

It is not about the present and future value of AI at all. It is about the present and future value of things other than AI. Here is the full quote:

"There is nothing else after generative AI. There are no other hypergrowth markets left in tech. SaaS companies are out of things to upsell. Google, Microsoft, Amazon and Meta do not have any other ways to continue showing growth, and when the market works that out, there will be hell to pay, hell that will reverberate through the valuations of, at the very least, every public software company, and many of the hardware ones too."

I am not doing some kind of sophisticated act of interpretation here. If AI is very little of big tech revenue, and big tech are posting massive record revenue and profits every quarter, then it cannot be the case that "there is nothing left after generative AI" and they “do not have any other ways to continue showing growth” — what is left is whatever is driving all that revenue and profit growth right now!

Grok 3.5: 400M training run DeepSeek R1: 5M training run Released around the same time, marginal performance difference.

I suspect that says more about Grok than anything else.

What a wrong take. Its not even MoE that was great in deepseek, its shared expert + grpo