Hacker News

cold_harbor 4 hours ago [ - ]

their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.

zozbot234 4 hours ago [ - ]

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.

vitorsr 2 hours ago [ - ]

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

trollbridge an hour ago [ - ]

I've "adjusted" my workflows now to use the cache. (Basically read all the files in your project very early on in your session, etc., simple stuff like that.)

Nearly all requests are cached now. It's amazing.

hmaddipatla 4 hours ago [ - ]

[dead]