I have really been trying to get local models to work. I have tried different harnesses, tooling, skills, prompts, etc. But when I compare claude code with anthropic models or codex with gpt 5.5, vs qwen, glm or gemma and the same harnesses, the frontier models come out massively ahead. I am at the point where I just don't see the point of the non-frontier models, they waste more time than they save.
The point is to not indenture yourself to a corporation whose motives do not align with your own.
Their motives are to make the best product to compete in a very competitive market.
local models are 3 to 6 months behind SOTA models with the huge benefit of not needing to send all your IP to a shady third party.
If inference cost comes down (as it has been for the last few years) you’ll be able to run today’s SOTA in your laptop by the end of the year.
"shady third party"
If Claude hosted on AWS bedrock is not considered trustworthy, I have some bad news for you.
I would say that is highly unlikely if by SOTA models you are not just referring to coding benchmarks but more general purpose ability and domain-specific knowledge. For example Kimi 2.6, which is comparable to Opus 4.6, is roughly 500+GB large, and I don't see how that would run on consumer hardware anytime soon. Besides, this is not just about the technical feasibility, but also economically not viable whatsoever. Why should consumer laptops be capable of running such models, when they would be massively underutilized most of the time, when inference providers can produce the same results faster, cheaper and a lot more viable economically?
It runs right now on 512gb RAM Macs and PCs.
It runs like shit though in terms of tokens/second and still has a reduced context window. Vs a single claude prompt can easily get into 300k tokens without breaking a sweat.
I want local AI to be a thing but the hardware isn’t here yet, because the only options are a Mac Studio or DGX machines strapped together. RAM prices needs to crash before local AI has a chance at actually competing.
Because privacy has perceived value.
the bigger issue is context windows. HUGE difference there.
For agentic coding I 100% agree with you, it's worse and slower and more expensive for LARGE coding with local models. Narrow coding (like writing a specific function) is slow but viable. Regular LLM chat usage on high-end consumer hardware is competitive except on cost though. 0
0 - https://www.williamangel.net/blog/2026/05/17/offline-llm-ene...
The hosted frontier models are massively subsidized, right? I think the point of local non-frontier models is just learning at this point, so you’ll be skilled if/when the market starts comparing the actual price of the two different models.
I came to the same conclusion. For the amount that a query costs, using Opus all the time is the cheapest option.
[dead]