This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?
This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?
I agree.. Maybe parts of the cache contents are business secrets.. But then store a server side encrypted version on the users disk so that it can be resumed without wasting 900k tokens?
Disk where? LLM requests are routed dynamically. You might not even land in the same data center.
But if you have a tiered cache, then waiting several seconds / minutes is still preferable to getting a cache miss. I suspect the larger problem is the amount of tinkering they are doing with the model makes that not viable.