What people don't realize is that cache is *free*, well not free, but compared to the compute required to recompute it? Relatively free.
If you remove the cached token cost from pricing the overall api usage drops from around $5000 to $800 (or $200 per week) on the $200 max subscription. Still 4x cheaper over API, but not costing money either - if I had to guess it's break even as the compute is most likely going idle otherwise.
Cache definitely isn't free! We're in a global RAM shortage and KV caches sit around consuming RAM in the hope that there will be a hit.
The gamble with caching is to hold a KV cache in the hope that the user will (a) submit a prompt that can use it and (b) that will get routed to the right server which (c) won't be so busy at the time it can't handle the request. KV caches aren't small so if you lose that bet you've lost money (basically, the opportunity cost of using that RAM for something else).
Why do you believe that caches are held in RAM? They don’t need RAM performance, and disk is orders of magnitude cheaper.
> What people don't realize is that cache is free
I'm incredibly salty about this - they're essentially monetizing intensely something that allows them to sell their inference at premium prices to more users - without any caching, they'd have much less capacity available.
> [...] if I had to guess it's break even as the compute is most likely going idle otherwise.
Why would it go idle? It would go to their next best use. At least they could help with model training or let their researchers run experiments etc.
inference compute is vastly different versus training, also it has to stay hot in vram which probably takes up most of it. There is limited use for THAT much compute as well, they are running things like claude code compiler and even then they're scratching the surface of the amount of compute they have.
Training currently requires nvidia's latest and greatest for the best models (they also use google TPU's now which are also technically the latest and greatest? However, they're more of a dual purpose than anything afaik so that would be a correct assesment in that case)
Inference can run on a hot potato if you really put your mind to it
I think I've heard multiple time that a large % of training compute for SoTA models is inference to generate training tokens, this is bound to happen with RL training
They can run any number of inference experiments. Like a lot of the alignment work they have going on.
I am not saying this would be a great use of their compute, but idle is far from the only alternative. (Unless electricity is the binding constraint?)
Electricity is charged whenever you use it or not, so very unlikely, but sure, they can find uses for it. Although they are not going to make that much money compared to claude code subscriptions.
> Electricity is charged whenever you use it or not, [...]
Huh, what? You know you can turn off unused equipment, and at least my nvidia GPU can use more or less Watts even when turned on?
Or does Anthropic have a flatline deal for electricity and cooling?
[dead]