While the benchmarks all say open source models Kimi and Qwen outpace proprietary models like GPT 4.1, GPT 4o, or even o3, my (and just about everyone I know's) boots on the ground experience suggests they're not even close. This is for tool calling agentic tasks, like coding, but also in other contexts (research, glue between services, etc). I feel like it's worth putting that out there--it's pretty clear there's a lot of benchmark hacking happening. I'm not really convinced it's purposeful/deceitful, but it's definitely happening. Qwen3 Coder, for example, is basically incompetent for any real coding tasks and frequently gets caught in death spirals of bad tool calls. I try all the OSS models regularly, because I'm really excited for them to get better. Right now Kimi K2 is the most usable one, and I'd rate it at a few ticks worse than GPT 4.1.

isn’t the problem with the benchmarks that most people running ai locally are running way lower weights?

i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b

apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable

Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.

In my case, I’m paying for inference on the original models from e.g. Fireworks. So it’s not a quantization problem. The Qwen3 I was using was the new 458B (i think that’s the size?) model that was their top performer for code.

I agree with other comments that there are productive uses for them. Just not on the scale of o4-mini/o3/claude 4 sonnet/opus.

So imo open weights larger models from big US labs is a big deal! Glad to see it. Gemma models, for example, are great for their size. They’re just quite small.

No, I've deployed a lot of open weight models and the gap between closed source is there even at larger sizes.

I'm running a 400B parameter model at FP8 and it still took a lot of post-training to get an even somewhat comparable performance

-

I think a lot of people implicitly bake in some grace because the models are open weights, and that's not unreasonable because of the flexibility... but in terms of raw performance it's not even close.

GPT-3.5 has better world knowledge than some 70B models, and a few even larger.

The big "frontier" models are expert systems built on top of the LLM. That's the reason for the massive payouts to scientists. It's not about some ML secret sauce, it's about all the symbolic logic they bring to the table.

Without constantly refreshing the underlying LLM and the expert system layer, these models would be outdated in months. Language and underlying reality would shift from under their representations and they would rot quick.

That's my reasoning for considering this a bubble. There has been zero indication that the R&D can be frozen. They are stuck burning increasing amouts of cash for as long as they want these models to be relevant and useful.

you're killing my dream of blowing $50-100k on a desktop supercomputer next year and being able to do everything locally ;)

"the hacker news dream" - a house, 2 kids, and a desktop supercomputer that can run a 700B model.

Take a look at: https://www.nvidia.com/en-us/products/workstations/dgx-spark... . IIRC, it was about ~$4K.

Given that for a non quantized 700B monolithic model with let's say a 1M token context, you would need around 20TB of memory, I doubt your spark or M4 will get very far.

I'm not saying those machines can't be usefull or fun, but it's not in the range of the 'fantasy' thing you're responding to.

I regularly use Gemini CLI and Claude Code, and I'm convinced that Gemini's enormous context window isn't that helpful in many situations. I think the more you put into context, the more likely the model is to go off into on a tangent and you end up with "context rot" or get confused and start working on an older no longer relevant context. You definitely need to manage and clear your context window and the only time I would want such a large context window is when the source data is really that large.

An M4 Max twice the memory bandwidth (which is typically the limiting factor)

I'll say neither of them will do anything for you if you're currently using SOTA closed models in anger and expect that performance to hold.

I'm on a 128GB M4 Max, and running models locally is a curiosity at best given the relative performance.

I'm running an M4 Max as well and I found that using project goose works decently well with qwen3 coder loaded on LM Studio (Ollama doesn't do MLX yet unless you build it yourself I think) and configured as an openai model as the api is compatible. Goose adds a bunch of tools and plugins that make the model more effective.

It will be sort of decent on a 4bit 70B parameter model, like here https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M). But yeah, not great.

I'm so darn confused on local LLMs and M-series inference speed, the perf jump from M2 Max to M4 Max was negligible, 10-20%. (both times MBP, 64 GB and max gpu cores)

Does your inference framework target the NPU or just GPU/CPU?

It's linking llama.cpp and using Metal, so I presume GPU/CPU only.

I'm more than a bit overwhelmed with what I've gotten on my plate and have completely missed the boat on ex. understanding what MLX is, really curious for a thought dump if you have some opinionated experience/thoughts here. (ex. never crossed my mind until now that you might get better results on the NPU than GPU)

It may be the way I use it, but qwen3-coder (30b with ollama) is actually helping me with real world tasks. Its a bit worse than big models for the way I use it, but absolutely useful. I do use ai tools with very specific instructions though, like file paths, line numbers if I can, and specific direction about what to do, my own tools, etc. so that may be why I don't see such a huge difference from big models.

I should try Kimi K2 too.

It has everything to do with the way you use it. And the biggest difference is how fast the model/service can process context. Everything is context. It's the difference between you iterating on an LLM boosted goal for an hour vs 5 minutes. If your workflow involves chatting with an LLM and manually passing chunks, and manually retrieving that response, and manually inserting it, and manually testing....

You get the picture. Sure, even last year's local LLM will do well in capable hands in that scenario.

Now try pushing over 100,000 tokens in a single call, every call, in an automated process. I'm talking the type of workflows where you push over a million tokens in a few minutes, over several steps.

That's where the moat, no, the chasm, between local setups and a public API lies.

No one who does serious work "chats" with an LLM. They trigger workflows where "agents" chew on a complex problem for several minutes.

That's where local models fold.

You'll see good results, Kimi is basically a micro dosing Sonnet lol. V v v reliable tool calls, but, because it's micro dosing, you don't wanna use it for implementing OAuth, maybe adding comments or strict direction (i.e. a series of text mutations)

Not sure about benchmarks, but I did use Deepseek when it was novel and cool for a variety of tasks before going back to Claude, and in my experience it was OK, not significantly worse for what I use these models for (writing code small functions at a time, learning about libraries etc.), tham closed stuff at the time.

While that's true for some open source models, I find DeepSeek R1 685B 0528 to be competitive with O3 in my production tests, I've been using it interchangeably for tasks I used to handle with Opus or O3.

I would have assumed anyone frequenting HN would have figured out by now that benchmarks are 100% bullshit. I guess I'd be wroing.

I think anyone frequenting HN and actually using these tools absolutely knows these benchmarks are 100% bullshit and the only real way to test these things is to just use them yourself.

Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.

So what do you propose? Gut feel, N=1 tests?

At the moment, the only way you can tell if the model is good for a particular task is by trying it at that task. Gut feel is how you pick the models to test first, and that is also based largely on past experience and educated guesses as to what strengths translate between tasks.

You should also remember that there's no free lunch. If you see models below a certain size fail consistently, don't expect a model that is even smaller to somehow magically succeed, no matter how much pixie dust the developer advertises.

[deleted]

it currently beats depending on the benchmarks

I mean, in other environments people say that.

If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.

Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)