How do they “nerf the models”?
Are they quietly compacting context to reduce kv cache usage, before the actual compaction? Like there’s a slider for how much to compress it, and that’s never revealed to us?
How do they “nerf the models”?
Are they quietly compacting context to reduce kv cache usage, before the actual compaction? Like there’s a slider for how much to compress it, and that’s never revealed to us?
I suspect they quantize them, reduce thinking budgets, batch more requests, or all of the above.
There's also lowering the number of experts you run in MoE models.