How do they “nerf the models”?

Are they quietly compacting context to reduce kv cache usage, before the actual compaction? Like there’s a slider for how much to compress it, and that’s never revealed to us?

I suspect they quantize them, reduce thinking budgets, batch more requests, or all of the above.

There's also lowering the number of experts you run in MoE models.