we could use some composability.

today any kind of routing requires implementing an http proxy to put in the middle

ideally harnesses would support a routing plugin which receives the new whole context and returns just where to send it, and the harness does that. no http proxy. obviously some complications if you want to route from codex to anthropic or openrouter.

but we need to decouple the context building and routing decision from the actual http requests sending, we need to be able to insert "context/routing plugins" in the chain

Mine does this. Only I don’t use the whole context with the router because that’s wildly resource intensive and slow.

Then, a bit like open router, it does a classifier job with a fast model to choose which one should process the turn.

In my case I usually don’t do local vs remote… although it can. Now I use it for thinking vs no-think against my preferred local model, which is a huge time saver even with the added classification step.

I'm still waiting for an isolated protocol so we don't have to run the hanress directly on any of the code base's infrastructure. Something as simple as piping everything into and out of an ssh shell would be better than anything I've tested so far.

i've created the protocol, role-model: https://github.com/try-works/role-model

Great name, but ironically hard to reason about from a role perspective, at least at the read me.

Does this interfere with cache hits? Could a single conversation or task span multiple roles?

Why are you building this? Does this maximize my toxen value by saving the hard tasks for the hard model? Does it maximize cache hits as part of its scoring? Does it help agents develop a specialist mindset? Are you anticipating users will have many local models hot, or is this also a model load/unload controller?

I'm building this to achieve a state where I can, as a user and on my own device, decide that certain type of workloads should be handled by my Qwen model and keep the data on my device, while other workloads should be handled by more capable models.

For this we don't just need a router, because the information to make detailed and accurate routing decisions currently doesn't exist. And there are no standards but every lab and maybe even inference providers have their own way of implementing reasoning, chat templates, cache, tool use and so on. All issues that make models non-interoperable.

What we need is applications that clearly specify their requests so they can be accurately routed to a provider, whether local or remote. And for that they need to use a standard protocol for model requests and intent.

I wrote a longer piece here: https://news.ycombinator.com/item?id=48706181