Hacker News

I suspect they quantize them, reduce thinking budgets, batch more requests, or all of the above.

There's also lowering the number of experts you run in MoE models.