their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

I've "adjusted" my workflows now to use the cache. (Basically read all the files in your project very early on in your session, etc., simple stuff like that.)

Nearly all requests are cached now. It's amazing.

[dead]