This seems very very far off. From the latest reports, anthropic has a gross margin of 60%. It came out in their latest fundraising story. From that one The Information report, it estimated OpenAI's GM to be 50% including free users. These are gross margins so any amortization or model training cost would likely come after this.

Then, today almost every lab uses methods like speculative decoding and caching which reduce the cost and speed up things significantly.

The input numbers are far off. The assumption is 37B of active parameters. Sonnet 4 is supposedly a 100B-200B param model. Opus is about 2T params. Both of them (even if we assume MoE) wont have exactly these number of output params. Then there is a cost to hosting and activating params at inference time. (the article kind of assumes it would be the same constant 37B params).

Gross margins also don't tell the whole story, we don't know how much Azure and Amazon charge for the infrastructure and we have reasons to believe they are selling it at a massive discount (Microsoft definitely does that, as follows from their agreement with OpenAI). They get the model, OpenAI gets discounted infra.

A discounted Azure H100 will still be more than $2 per hour. Same goes for AWS. Trainium chips are new and not as effective (not saying they are bad) but still cost in the same range.

For inference, gross margins are exactly: (what companies charge per 1M tokens to the user) - (direct cost to produce that 1M tokens which is GPU costs).

I am implying that what OpenAI pays for GPU/hour is much less than $2, because of the discount. That's an assumption. It could be $1, $0.5, no?

It could still be burning money for Microsoft/Amazon

Are you saying that you think Sonnet 4 has 100B-200B _active_ params? And that Opus has 2T active? What data are you basing these outlandish assumptions on?

Oh nothing official. There are people who estimate the sizes based on tok/s, cost, benchmarks etc. The one that most go on is https://lifearchitect.substack.com/p/the-memo-special-editio.... This guy estimated Claude 3 opus to be 2T param model (given the pricing + speed). Opus 4 is 1.2T param according to him (but then I dont understand why the price remained the same.). Sonnet is estimated by various people to be around 100B-200B params.

[1]: https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJ...

If you're using the api cost of the model to estimate it's size, then you can't use this size estimate to estimate the inference cost.

tok/s cannot in any way be used to estimate parameters. It's a tradeoff made at inference time. You can adjust your batch size to serve 1 user at a huge tok/s or many users at a slow tok/s.

Not everyone uses MoE architectures. It's not outlandish at all...

There's no way Sonnet 4 or Opus 4 are dense models.

Citation needed

Common sense:

- The compute requirements would be massive compared to the rest of the industry

- Not a single large open source lab has trained anything over 32B dense in the recent past

- There is considerable crosstalk between researchers at large labs; notice how all of them seem to be going in similar directions all the time. If dense models of this size actually provided benefit compared to MoE, the info would've spread like wildfire.