Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.

This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.

A sibling comment explains:

https://news.ycombinator.com/item?id=47886200