This would not work in the way that shows any significant genuine benefit IMO. Caching and optimum routing of a single request are at odds with each other. Higher the distinct model count in a conversation, more cache misses you accept.
Based on what OP said elsewhere in the discussion "threshold to switch to another model will be higher" means that essentially you reduce the workflow into two models at most. The two model primitive, one planner and one executor, is already sufficient for such a use case.
For lower than 2 models, it's just a simple single model cache-preserving conversation which arguably doesn't need another layer. For larger than 2 models, you are likely paying a large aggregate cache penalty that negates most of the gains
When we started building this we did it as an experiment and we thought the same thing might be true (cache misses would make the whole thing pointless). This turned out not to be true! I think there are 3 reasons intuitively:
1. Small models can carry out a good number of requests e2e 2. Small model for part of a request + cache miss < big model for entire request in many cases 3. Subagents
For our own usage we've saved 40% so far (that is of course including costs of uncached requests when switching models)
This assumes a perfect problem routing though. Determining the complexity class of an arbitrary problem is generally undecidable or extremely hard (Rice's theorem implication). So, in real use cases, you need to amortize all cases where the problem got routed to the wrong model and recovery had to be performed)
For example, if my task was "refactor this component to decouple all messy nesting", the problem router can't possibly know what is being referred to. This works for clear cut and dry problems but not for ambiguous problems. Most of the real world problems carry a lot of ambiguity.
I think the key detail here is that we use embeddings of the prompt + previous context in order to decide where to route the request, and if one model is getting stuck we can course correct and move to a different model.
So: we can reasonably cluster similar problems together and learn how models handle them, and the entire system doesn't fail if the initial decision is off.
In my mind one of the problems is that I'm using the term 'router' to describe something more akin to a train schedule. A list of abilities, cost, and timeframe to be used by a model capable of deconstructing its own process.