Of course you can pass in your own state, but I always wondered about an LLM that has conversation context stay resident in GPU memory somehow.
Or maybe this already effectively covered by context caching and the gains would be minimal (stateless, but if you pass in the same context or the same head context, it’s already in GPU memory and doesn’t need to be loaded?).