Since Gemma 4 came this easter the gap from self hosting models to Claude has decreased sigificantly I think. The gap is still huge it just that local models were extremely non-competitive before easter. So now it seems Qwen 3.6 is another bump up from Gemma 4 which is exciting if it is so. I keep an Opus close ofcourse, because these local models still wander off in the wrong direction and fails. Something Opus almost never does for me anymore.

But every time a local model gets me by - I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.

My setup is a seperate dedicated Ubuntu machine with RTX 5090. Qwen 3.6:27b uses 29/32gb of vram when its working right this minute. I use Ollama in a non root podman instance. And I use OpenCode as ACP Service for my editor, which I highly recommend. ACP (Agent Client Protocol) is how the world should be in case you were asking, which you didnt :)

Exciting times and thank you Qwen team for making the world a better place in a world of Sam Altmans.

>> I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.

I’m just pleased by the competition, agree with the ideal of free and local but sustainable competition is key: driving $200 p/m down to a much much lower number.

Gemma4 feels the most "claude-like" of all the models I've run locally on my M5 mbp.

I found on coding tasks that Qwen 3.5 can actually do the thing whereas Gemma 4 went off the rails frequently. Will try this new 3.6 release today.

I use Qwen 3.5 122B on an RTX PRO 6000 with open code, and very pleased. I don't feel a need for using a closed model any more. The result after answering questions in Plan mode is almost always what I want, with very few occasional bugs. It does a lot of effort to see how the code I am working on is written now while extending it in the same style.

If they release a Qwen 3.6 that also makes good use of the card, may move to it.

There was a qwen-3.6 MoE six days ago that I thought was better than Gemma 4. Today's is a dense model. (gemma release both a 26B MoE and a 31B dense at the same time)

I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.

What level of programming tasks can a 27B model handle? Even with Claude, I'm occasionally not satisfied, and I can't imagine how effective a 27B model would be.

I ran 3 prompts (short versions, full version in the repo):

- Implement a numerically stable backward pass for layer normalization from scratch in NumPy.

- Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).

- Implement an efficient KV-cache system for autoregressive transformer inference from scratch.

and tested Qwen3.6-27B (IQ4_NL on a 3090) against MiniMax-M2.7 and GLM-5 with kimi k2.6 as the judge (imperfect, i know, it was 2AM). Qwen surpassed minimax and won 2/3 of the implementations again GLM-5 according to kimi k2.6, which still sounds insane to me. The env was a pi-mono with basic tools + a websearch tool pointing to my searxng (i dont think any of the models used it), with a slightly customized shorter system prompt. TurboQuant was at 4bit during all qwen tests. Full results https://github.com/sleepyeldrazi/llm_programming_tests.

I am also periodically testing small models in a https://www.whichai.dev style task to see their designs, and qwen3.6 27B also obliterated (imo) the other ones I tested https://github.com/sleepyeldrazi/llm-design-showcase .

Needless to say those tests are non-exhaustive and have flaws, but the trend from the official benchmarks looks like is being confirmed in my testing. If only it were a little faster on my 3090, we'll see how it performs once a DFlash for it drops.

Basic triage is good. I've found I need to mostly handle programming, but local models have been good for pointing me at where to look with just "investigate https://github.com/HarbourMasters/Shipwright/issues/6232" as prompt

> Qwen 3.6:27b uses 29/32gb of vram

What context size are you using for that?

Btw, are you using flash attention in Ollama for this model? I think it's required for this model to operate ok.

I squeezed it into 24 GiB VRAM (since I have RX7900XTX):

-- Q5_K_M Unsloth quantization on Linux llama.cpp

-- context 81k, flash attention on, 8-bit K/V caches

-- pp 625 t/s, tg 30 t/s

Depends entirely on quantization. Q6_K with max context length (262144) is ~40GB of VRAM.

Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.

How many tokens/s do you get on RTX 5090?

I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio.

Wow, that sounds usable. I know it's anecdotal but how did you find the quality of the output, and can you compare it to any closed source model?

Not that you asked but I’m getting ~20 tokens/s on my DGX Spark (Asus actually) using an Int4 AutoRound quant, MTP 1 and some other tricks

Can't answer for an RTX 5090, but for an RTX 5080 16GB of RAM (desktop), I get about 6 tokens/sec after some tweaking (f16->q4_0). Kind of on the borderline of usable.. probably realistically need either a 5090 with more RAM or something like a Mac with a unified memory architecture.

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

A Mac is not going to be all that much faster than a 5080 with any models, other than the ones you can’t currently run at all because you don’t have enough GPU+CPU memory combined.

You’re much better off adding a second GPU if you’ve already got a PC you’re using.