Are you saying that you think Sonnet 4 has 100B-200B _active_ params? And that Opus has 2T active? What data are you basing these outlandish assumptions on?
Are you saying that you think Sonnet 4 has 100B-200B _active_ params? And that Opus has 2T active? What data are you basing these outlandish assumptions on?
Oh nothing official. There are people who estimate the sizes based on tok/s, cost, benchmarks etc. The one that most go on is https://lifearchitect.substack.com/p/the-memo-special-editio.... This guy estimated Claude 3 opus to be 2T param model (given the pricing + speed). Opus 4 is 1.2T param according to him (but then I dont understand why the price remained the same.). Sonnet is estimated by various people to be around 100B-200B params.
[1]: https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJ...
If you're using the api cost of the model to estimate it's size, then you can't use this size estimate to estimate the inference cost.
tok/s cannot in any way be used to estimate parameters. It's a tradeoff made at inference time. You can adjust your batch size to serve 1 user at a huge tok/s or many users at a slow tok/s.
Not everyone uses MoE architectures. It's not outlandish at all...
There's no way Sonnet 4 or Opus 4 are dense models.
Citation needed
Common sense:
- The compute requirements would be massive compared to the rest of the industry
- Not a single large open source lab has trained anything over 32B dense in the recent past
- There is considerable crosstalk between researchers at large labs; notice how all of them seem to be going in similar directions all the time. If dense models of this size actually provided benefit compared to MoE, the info would've spread like wildfire.