In a chat setting you hit the cache every time you add a new prompt: all historical question/answer pairs are part of the context and don’t need to be prefilled again.
On the API side imagine you are doing document processing and have a 50k token instruction prompt that you reuse for every document.
It’s extremely viable and used all the time.
I’m shocked that this hasn’t been a thing from the start. That seems like table stakes for automating repetitive tasks.
It has been a thing. In a single request, this same cache is reused for each forward pass.
It took a while for companies to start metering it and charging accordingly.
Also companies invested in hierarchical caches that allow longer term and cross cluster caching.