I expected to see OpenAI, Google, Anthropic, etc. provide desktop applications with integrated local utility models and sandboxed MCP functionality to reduce unnecessary token and task flow, and I still expect this to occur at some point.

The biggest long-term risk to the AI giant's profitability will be increasingly capable desktop GPU and CPU capability combined with improving performance by local models.

From experience it seems like preempting context scoping and routing decisions to smaller models just results in those models making bad judgements at a very high speed.

Whenever I experiment with agent frameworks that spawn subagents with scoped subtasks and restricted context, things go off the rails very quickly. A subagent with reduced context makes poorer choices and hallucinates assumptions about the greater codebase, and very often lacks a basic sense the point of the work. This lack of situational awareness is where you are most likely to encounter js scripts suddenly appearing in your Python repo.

I don’t know if there is a “fix” for this or if I even want one. Perhaps the solution, in the limit, actually will be to just make the big-smart models faster and faster, so they can chew on the biggest and most comprehensive context possible, and use those exclusively.

eta: The big models have gotten better and better at longer-running tasks because they are less likely to make a stupid mistake that derails the work at any given moment. More nines of reliability, etc. By introducing dumber models into this workflow, and restricting the context that you feed to the big models, you are pushing things back in the wrong direction.

Yup. I expected a google LLM to coordinate with many local expert LLMs with knowledge of local tools and other domain expert LLMs in the cloud.

I they don't see a viable path forward without specialty hardware