Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)
Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.
Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization
> And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.
For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.
This is, as far as I know, the business model of coys like mistral and cohere
On-premise (1960-2010) -> Cloud (2010-2026) -> On-premise (2026+)?
I think that's overstated, but the loss of trust companies have with the big AI players is pretty serious. Not a big deal if your app is for sharing cat videos, but if you're medical or wealth management or a government contractor or the like enterprise clients really like to see good data security policies.
> Not a big deal if your app is for sharing cat videos, but if you're medical or wealth management or a government contractor or the like enterprise clients really like to see good data security policies.
If this mattered to them, they wouldn't be running so much in the cloud or in proprietary software that they have no ability to air-gap.
If companies ever cared about this, Windows would not be dominant on the desktop.
There are a lot of government jobs I know of that are absolutely air-gapped. Your computer has basically no internet access, everything is stored on-prem. Hedge funds also tend to be extremely locked down, from what I saw when I interviewed. With certain data sets either having strict encryption-in-transit or a being stored in a quirky on-prem service. I can't imagine they're going to be dumping their data into Claude, etc.
As to why Windows is so dominant, I'm as clueless as you.
Agree. I also wonder how zero e.g., Claude Enterprise ZDR really is, and what their data pipeline actually looks like.
I think the next step to anyone but overbloated USA models is to follow https://chatjimmy.ai/ with one of the qwen models. If they can mass produce something at relative cost, these would be awesome sidecars.
> (starts to get a bit dumb above 160k ish)
If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.
I think we'll get there. Right now it works for me, because I'm naturally pretty verbose in my prompts, and know the codebase well, so I know what it needs to look at. Plus subagents for anything exploratory.
I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know
Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.
In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)
DeepSeek V4 (both Flash and Pro) has very good scaling of context length wrt. RAM use, so this is not an inherent limit of LLMs in general.
With yarn and rope scaling arguments for llama.cpp you could run qwen3.6-27B with 1M context… if you have enough memory to store it.
I don't really think you're making reasonable decisions at that size; but I suppose if you're not allowed to refactor it, maybe.
I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.
Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.
Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.
Just this morning I tweaked my single 3090 setup too:
and that fits in 23GB.[edited for format]
are you running an NVLink? I have the same setup but no NVLink and it feels like it's best just splitting the 3090s to run separate models concurrently. But I also have no idea what I'm doing.
It depends on what you're comparing. If the same model fits on the combined VRAM but not on a single contiguous VRAM, then it won't be faster to run two instances of it. If you're comparing a 23 GB model running duplicated vs a 46 GB model running split, then yeah, that will likely be faster, just because there's no synchronization between cards.
AFAIUI, there'd be little advantage in having a higher speed inter-card connection, because the cards don't really talk to each other during inference. The loss of efficiency compared to a monolithic memory architecture comes from scheduling, not from data transfer.
Do you have any resources on hardware necessary for running models and tweaks? I see you mention 2x 3090 and I wanted to do more search on what hardware is satisfactory for what models.
How long have you been using it?