Hacker News

How do they “nerf the models”?

Are they quietly compacting context to reduce kv cache usage, before the actual compaction? Like there’s a slider for how much to compress it, and that’s never revealed to us?

airstrike 5 hours ago [ - ]

I suspect they quantize them, reduce thinking budgets, batch more requests, or all of the above.

lwarfield 3 hours ago [ - ]

There's also lowering the number of experts you run in MoE models.