Looks like we are seeing small but mighty model breakthroughs, outpacing the pure capital firepower of SOTA providers. I love rooting for the little guy, but is it too soon to call it? To play devils advocate, could it just be the benchmarks are not efficient enough to capture success of real developer workflows?
I think people are going to continue to be surprised by the capability of small models.
Now, if you ask this model to have a conversation with you, it's gonna fail and be incoherent. But boy, does it sure reason through math problems well.
I've just started using qwen3.6:35b a couple days ago running on my framework desktop and rather impressed. It runs really well and reminds me of probably the first Claude model I used. It's the first local model that's actually working for me in a coding agent I've tried. Very exciting!
Try 27b, it's significantly smarter than 35b-a3b (although it is slower, it's not so bad with MTP).
It is, but it's way too slow on a Strix Halo due to its limited bandwidth.
(I'm still sad that they didn't make a 122B-A10B version of it, as it's the kind of model that fits best on a Strix Halo, and for 3.5 it was comparable in performance to the dense 27B version).
Yeah the speed is vastly different but it's getting ~10 tps. And the MoE model is like 50 or something. Might use it if it proves to be much smarter I dont really monitor my agents while they're running.
At least according to gertlabs, Qwen3.6 27B outperforms every SoTA (closed) model at Kotlin: https://archive.vn/RYBCL / https://gertlabs.com/rankings?mode=agentic_coding&language=k...
Interesting. I wonder if there is opportunity to train a set of small model variants to excel at a certain stacks. Eg Qwen3.6-27B for Node + React or Qwen3.6-27B for Rust + TUI
This is always how I've imagined small/consumer-hardware models going in time. If I only ever code in Python, give me a model that does just that (plus some general CS, algorithms, structure, etc.) and does it super-fast and well. Make it small enough that if I need a Python back end and an HTML front end, another specific model can load alongside and collaborate on the front end.
Or give me a pure shopping model that has a general understanding of products and product categories, and then will playwright/scrape/API into shopping sites to compare options and find me what I want. Etc.
Qwen 3.6 27B is an anomalously strong all-around model for its size, but when we run our evaluations, we generate 10 coding submissions/language/model (110 total). So full discosure, the per-language per-model performances can be noisy (I do not think Qwen3.6 27B is better than Fable 5 in agentic workflows when writing Kotlin, given enough samples, although we do find some interesting anomalies that hold up under large sample sizes).
Hmm, I just assumed bigger was better. How's it different?
Off the top of my head since it seems to be the quick info you're looking for: IIRC, with these two, the 27B is a dense model, meaning it's all active at inference. Meanwhile, the 35B is a Mixture of Experts (MoE), so only part of its network (3B?) is active at any time.
Thanks! Dense models have been slow on my compute, but I'll give it a try. If its not toooooo slow then it's fine I mostly fire and forget agents anyway.
Edit: seems fast! I'll try it out some more, thanks again.
I'm running qwen36.:35b:iq4 IQ4_XS quant. Takes 18 GB of RAM with 131k context window. Seems to be really good. Have it running local stuff via Hermes, using a cloud model via Ollama (Deepseek V4-Pro) for heavy lifting.
If your framework desktop is the 128G Strix Halo, I recommend giving Qwen 3.5 122B-A10B a shot.
This Q5_K_M quant should be near lossless and fit with full 256K context in about 100GB of RAM: https://huggingface.co/AesSedai/Qwen3.5-122B-A10B-GGUF
3.6 scores better on coding across the board.
Edit: specifically Qwen 3.6 27B beats that on coding and agentic workflows.
I'll keep this in mind.
Could you please share which coding agent you are using with it?
Crush: https://github.com/charmbracelet/crush/
The Q8_K_XL MTP model from Unsloth: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF
I settled on opencode after trying goose and aider as well. I'll probably try some more but opencode worked similar to Claude code which is my main agent.
I serve the model with ollama and am thinking about replacing ollama but haven't looked into it.
I have openwebui for chat if I want that too, but don't really use it.
npx @oh-my-pi/pi-coding-agent
I am using Mistral Vibe.
Pi
It feels sometimes like optimizations are only starting.
I’m beginning to suspect the closed SOTA labs were doing all these optimisations, keeping quiet about it, and just charging us out the yinyang for inference.