Hacker News

Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.

So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.