The way coding agent work is fantastically wasteful. All the megabytes of code are processed over and over and over, sometimes withing just one session.

There are papers describing KV cache precomputation for commonly used documents (e.g. KVLink), but, of course, it's not a priority for model providers: they'd rather sell you more tokens, also they would rather get to AGI/ASI first than optimize usage of existing models...

Claude code gets >98% KV cache hits. It’s not reprocessing unless you let the cache go cold (5 minutes, which is annoyingly short).

I meant caching on a bigger level. If you're an organization with 100 developers each doing 10 sessions a day, you're paying for 10000x tokens in frequently used document even if you had 100% KV cache hits within one session. Apparently that's too costly even for companies with trillion dollar market cap...

Normally KV cache works only if your context prefix is identical, but there are papers which demonstrate documents can be cached between different contexts.

Ah, understood, and thanks for the clarification!

I believe OP is talking about new sessions or after compaction. He’s getting at the fact that LLMs are stateless and have to rediscover your codebase on every new session.

To be fair, on the Monday morning after a holiday, that’s exactly what I’m like too.

[deleted]

Are you sure that hitting the cache mean you’re not paying for those tokens?

You pay, at 10% the price (in quota or dollars) for non-cached. See https://platform.claude.com/docs/en/about-claude/pricing

Thanks, I should have checked, their pricing table is pretty clear, I was lazy

[deleted]