I suspect they quantize them, reduce thinking budgets, batch more requests, or all of the above.
There's also lowering the number of experts you run in MoE models.
There's also lowering the number of experts you run in MoE models.