Hacker News

jetbalsa 19 hours ago [ - ]

I'm unsure of the sizes needed for prompt cache, but I suspect its several gigs in size (A percentage of the model weight size), how would the user upload this every time they started a resumed a old idle session, also are they going to save /every/ session you do this with?

skissane 18 hours ago [ - ]

They could let you nominate an S3 bucket (or Azure/GCP/etc equivalent). Instead of dropping data from the cache, they encrypt it and save it to the bucket; on a cache miss they check the bucket and try to reload from it. You pay for the bucket; you control the expiry time for it; if it costs too much you just turn it off.

im3w1l 19 hours ago [ - ]

A few gigs of disk is not that expensive. Imo they should allocate every paying user (at least) one disk cache slot that doesn't expire after any time. Use it for their most recent long chat (a very short question-answer that could easily be replayed shouldn't evict a long convo).

_flux 7 hours ago [ - ]

I don't know how large the cache is, but Gemini guessed that the quantized cache size for Gemini 2.5 Pro / Claude 4 with 1M context size could be 78 gigabytes. ChatGPT guessed even bigger numbers. If someone is able to deliver a more precise estimate, you're welcome to :-).

So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.

spunker540 13 hours ago [ - ]

Whats lost on this thread is these caches are in very tight supply - they are literally on the GPUs running inference. the GPUs must load all the tokens in the conversation (expensive) and then continuing the conversation can leverage the GPU cache to avoid re-loading the full context up to that point. but obviously GPUs are in super tight supply, so if a thread has been dead for a while, they need to re-use the GPU for other customers.