Hacker News

Useful tool, but the "can I run it" question obscures the more important question: "should I run it locally for my use case?"

For interactive chat and simple Q&A, local models are great — latency is predictable, privacy is absolute, and the quality gap with frontier models is narrowing for straightforward tasks. A quantized Llama running on an M-series Mac is genuinely useful.

But for agentic workflows — where the model needs to plan multi-step tasks, use tools, recover from errors, and maintain coherence across long interactions — the gap between local and frontier models is still enormous. I have seen local models confidently execute plans that make no sense, fail to recover from tool errors, and lose track of what they are doing after a few steps. Frontier models do this too sometimes, but at a much lower rate.

The practical middle ground I see working well: local models for fast, cheap tasks like commit message generation, code completion, and simple classification. Frontier API models for anything requiring planning, reasoning over large contexts, or reliability. The economics favor this split — running a local model costs electricity and GPU memory, while API calls cost per token. For high-volume low-complexity tasks, local wins. For low-volume high-complexity tasks, APIs win.