That's almost exactly my setup and I'm very happy with its performance.
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
I keep finding more and more usecases for Q3.6 27b (same league) and the best performance is, when answers to my question is already in the context.
The moment I'm trying something open-ended or ambitious, Claude/ChatGPT clearly take you to the goal quicker.
For things, where there's a way to build a knowledgebase though, the local llm definitely can be a true contender. Plus, having a big context and no worries about filling it over and over - you can get quite far.
I'm writing this, literally in between cooking a pasta, that the local llm ordered products for me online. I've built a grocery shopping skill, so that it roughly knows what I have in fridge (losely), my last 10 representative orders (general preferences plus rich info about shops and skus around me) and actual real-time in stock info. The last part has been my personal pet peeve for every product that promised cooking ingredient delivery (that is not packaged specifically for that).
This is what has been promised to us by every big tech company with an agent, and now a local llms actually solved that for me fully.
I have said this before as well: these top-of-the-line models write clever, convoluted code. The code looks intelligent from above, but is a maintenance headache. Makes entire thing fragile for future developments on top of it.
The smaller models, especially the aforementioned ones, they fail much more, but, do not write that insanity of the code. They do simple, non-clever coding like humans do. Much easier to maintain and build upon.
Qwen-3.6-27b is a wonderful model. Exceptionally good for it's size, and excellent in general as well. And with mtp available now, it can run at 60+ tps on a single 3090... this is roughly 30% faster tgs than most of the hosted ones being served from giant data-centers.
It's also going to fail consistently. When calling Claude you don't know what version of the model you are talking to, it might be quantified sure to load or have been patched.
This is true. The failure modes are simpler. And yes the ceiling is lower as well. Smaller models stability is lower over long sequences. And thus anything that needs a lot of CoT will be weaker. For example, I had a dumb lock + condvar with multiple defenses against lost wakeups in a N producer 1 consumer queue thing. Models generally need a lot of CoT before they realise they can switch it to a semaphore instead. Qwen typically isn't stable over such long CoTs and ends up adding more and more slop and band aids versus a larger model that outputs a large CoT and then realises it can swap 3 functions out with 2 lines if we use a semaphore.