Hacker News

I'm not sure I understand what this is trying to solve?

If a prompt I give routes to one model, and then another prompt to another model, how does one tie the context together such that the next model knows what's going on?

Otherwise this would only be useful for one-off prompts as far as I can tell.

And if it did keep a context to be passed around, it would always land hot (not in the cache).

SwellJoe 11 hours ago [ - ]

Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.

So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.

nok22kon 9 hours ago [ - ]

sending the whole conversation to a cheap model could still be cheaper than sending just the latest message to the expensive one

you could even take this into account automatically to help decide

try-working 10 hours ago [ - ]

Here's a use case: You want to extend the GPT 5.5 quota in you Codex subscription by routing some % of requests to DeepSeek V4 Pro. A router needs to figure out which requests to route where, for the appropriate difficulty level.

Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.

The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.

nok22kon 9 hours ago [ - ]

to add to that, for example at the end of implementing a task, where the model runs the formatters, linters, tests, commits, pushes, this could be done by a very cheap model, and only switch to the main model again if something fails hard

there are some cache-busting considerations, but solvable

try-working 9 hours ago [ - ]

indeed. i also wrote elsewhere that the current ideal number of models in a pool is probably 2, so if you route between two both will have warm cache, though not the full cache at all times, so you lose a little but not much.

10 hours ago [ - ]

[deleted]

holoduke 9 hours ago [ - ]

LLMs have no state. There is nothing remembered, nothing new learned. It's the same input , the same output always (unless seeding is randomized). So during a chat it won't matter if every chat turn a different provider is used.

spiderfarmer 11 hours ago [ - ]

I'm not sure if output of easy commands like "summarize this" are added back to the context? I always assumed they are in a separate UI layer?