Gotta say, I've lost all interest in cloud-based AI products. Too many cool features and workflows that I was once excited about that I can't or don't use anymore for a variety of reasons (price hikes, subjectively nerfed, disappeared altogether, replaced,...) for me to even remember. It's tiring.

I've set up a small rig, mostly settled on Qwen3.6 and I'm slowly adding features myself. It probably can't compete with Claude. I don't even know, I've stopped checking. It's providing a ton of value to me as is, and it only keeps getting better. All it takes is to realize that it doesn't actually matter if the grass is (maybe even objectively) greener somewhere else. Feels so good to know that it won't change under my feet. I've got this amazing, highly extensible tool, and it's mine.

I'm really happy this is one of the top comments here, I am fully local as well.

Just wanted to leave a note for folks who might not have the memory to run a big 32gb model - I just found out there are some pruned models that have really good performance and If I had a smaller machine I might try this pruned unsloth Q4 quant of GLM 4.7 flash that sits at 14gb: https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GG...

I usually use LM Studio for this type of thing but unsloth has their own studio type app that might be even better suited for these quants.

I used GLM 4.7 flash as my main model for months and it was an incredibly tenacious model and very very fast - I think on restricted hardware, this could be a great choice.

Qwen3.6-35B-A3B-UD-Q4_K_M runs at about 11 tokens/second on my poor old 1060. Absolutely nuts how far we've come

I tried running any model on my 1070 and it instantly crashes my old tower, probably time to get off windows and run linux on it.

Understated how much of a boon for Linux that AI development has been.

There isn’t any benefit to running a windows machine.

Au contraire, I run models on WSL and my desktop reliably wakes up from sleep. Best of both worlds.

Sounds like a hardware issue, though NVIDIA driver issues can't be ruled out, they're much rarer these days

Mind sharing your llama.cpp settings for that?

  .\llama-server.exe -m ..\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -ngl 999 --n-cpu-moe 41 -c 262144 --port 8081 --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3 --no-mmap --mlock --host 0.0.0.0 -t 8 -tb 8 -np 1
Using this llama.cpp fork https://github.com/TheTom/llama-cpp-turboquant and mostly copying from this video https://www.youtube.com/watch?v=8F_5pdcD3HY

Haven't had much time to test it other than asking a few questions & changing some HTML in cline so it might be thick as a brick for all I know, but still worth trying

I just tested it with some risc-v code and it wrote down a "mov" instruction several times.. yeah something needs tuning maybe

I often feel like we're nowadays mostly pushing AI developments in the ways of finetuning differences. Like how new editions of Claude are tuned for agentic coding which might even be detrimental if you're using it for non-agentic coding. Or how Fable 5 in fact do look great but at a huge cost for inference and a high likelihood of post-launch nerfs or limit/price revisions. How Gemini 3.5 has more liberal limits but on the other hand underperforms a bit.

It's like we're mostly treading mud at this point. New editions are released, a version number increases, but I have to wonder if all steps are forward or they're more just tuned differently with similar actual perf per dollar as when this year began.

Most in fact seem to be happening to me with small models. Like your Qwen. Or Gemma 4 31B which is kinda magic especially when considering multilingual abilities. So yes, in that sense I can see "development" probably as we refine data sets and training methods but I see it less on the big hulking beasts with daily limits (unless you turn it up to 11 like Fable).

Edit: As I posted this, I saw a "before and after" comparison for Fable and the reintroduced version is seeing a catastrophic drop in BridgeBench performance as they're still mucking with the model. Go figure... https://x.com/Hesamation/status/2072692225100612032

Same here, been happy throwing Qwen3.6 on my old MBP - no it's not as fast as Claude which I use at work, but it works well enough locally and I don't have to worry about credits or shit like the rug getting pulled under me in terms of capabilities.

This sounds very appealing. What size Mac mini would I need for that?

Personally, I would always max out the RAM you can fit into your budget. You might get lower bandwidth (= slower generation) than you do on a Mac if you choose a Strix Halo or DGX Spark, but there are always new tweaks being discovered to speed things up. That being said, with 32GB you should be able to fit an ok quant of 35B-A3B or 27B with some context, with 64GB you should be golden.

i have issues on a m5/64g with 35b-a3b (mlx) it eventually hits a memory cap around 52gb... but i'm pretty happy with `Qwen3.6-27B-Claude-Opus-Reasoning-Distilled-mlx-8Bit`

I'm sure there will be a fix for it, but it illustrates an important broader point I should probably have made above: if you opt for local AI today, expect to run into some issues. Expect to learn a bit about the tools you're using, the not-so-fun way. I'm not recommending it to non-technical friends (yet).

A PC with an nvidia card with 16gb vram works just fine for Qwen MoE models, and these have worked great as a daily driver for me.

A 4-bit quantization of either Qwen 3.6 27b or Gemma 4 31b will run on a 32GB Mac with a decent-sized, but not full-sized, context. 64GB gets you the full ~256k context and you don't need to quantize your KV cache (though 8-bit quantization of KV may be worth it for performance). The 4-bit QAT version of Gemma 4 has practically identical performance to the full size version or the 8-bit version in most benchmarks and my tests, so there's no reason to run anything else. The 4-bit Qwen is a little bit lossy, as it hasn't gotten the QAT treatment, but not catastrophically lossy. A 6-bit dynamic quantization would be better for that model, but it's ~25GB on disk, and you'll need more than 32GB to run it with a big context.

I wrote up how I run local LLMs, with numbers and a focus on running Qwen 3.6 and Gemma 4. I prefer Gemma 4 31b, even though the general consensus is that Qwen 3.6 is better for code, and it is better on most coding focused benchmarks...it doesn't seem to be for my use cases, Gemma feels smarter. And, with QAT, you get more smarts in less memory, so it's fast and runs on more hardware.

https://swelljoe.com/post/how-i-run-local-llms/

Currently, the sweet spot for self-hosted models is either Qwen 3.6 or Gemma 4, and those top out at 31B (Gemma) and 35B (for Qwen, but you want the dense Qwen 3.6 27B if you can run it as reasonable speed...the dense models are much smarter), so for now, a system with 64GB or 128GB is going to be running the same models. Going to a bigger model doesn't get you better performance because there aren't any better models that are a little bigger. I wish there was a ~70B or even ~120B MoE in the Qwen 3.6 or Gemma 4 families, as I've got a Strix Halo running a model that leaves a lot of memory on the table (and it's not very fast, to boot...an MoE would be faster, and hopefully smarter if it's a much bigger model, like double or triple sized).

In short, right now, 64GB is all you need for the best models you can self-host on anything short of five-figure machines, but, I wouldn't buy any hardware right now, if you can wait a while. Tokens from DeepSeek are so cheap, you can wait out the memory shortage and get access to models you could never host locally. And, OpenRouter always has free models in preview or just because that you can use lightly, as they're rate-limited (but your self-hosted models are going to be rate-limited, too, because a Mac Mini can't run models very fast). Google AI Studio has the Gemma 4 models for free too, also rate/usage limited.

Good summary blog: https://maloyan.xyz/blog/running-qwen-locally-mac-mini-m4

> That's not hypothetical — it's a real measurement on the base model Mac Mini.

Hmmm

I am curious if you implicitly assumed they are Macs or if that's what you are looking for specifically?

I assumed the 27B dense model would be preferable to a MoE model, and that it wouldn’t fit into a consumer graphics card, which leaves the Macs.

Then I assumed for cost and battery/heat reasons that a Mini would be better than a laptop.

The current dense models from Gemma 4 or Qwen 3.6 families will run well on a consumer GPU with 32GB in a 4-bit quantization (which is a little lossy for Qwen 3.6, not so much for Gemma 4, as it has a QAT 4-bit version). Even an Intel ARC B70 will work, though it's worth spending a little more for a the AMD Radeon AI Pro 9700, as it'll be like 40% faster, I think. A dedicated GPU will be faster and cheaper than a Mac Mini. But, nothing is a good deal right now, everything is overpriced (except DeepSeek tokens, which cost pennies to run a model that's better than anything you could self-host...DeepSeek V4 Flash, and even Pro, are absurdly cheap, made even cheaper by their bonkers cheap cached token pricing and uniquely effective caching).

The reason why I was curious is that I am running my stuff on a Strix Halo and I get the feeling that this class of devices ( gmktek, minisforum, lenovo, etc. ) seem to becoming a pretty good alternative

Unified memory feels like the future of consumer hardware, agreed! Do check out r/StrixHalo

Agreed, it was a bit of a pain to get running on my Ubuntu machine because I had old amdgpu-dkms-firmware packages installed without realizing it. But now that it's running it's amazing how well it works

Sounds like you got it sorted, but more generally this may be interesting: https://github.com/kyuz0/amd-strix-halo-toolboxes

Strix Halo is better performance than a Mac Mini, but not as good as a Mac Studio. But the 128GB unified memory is awesome for larger models.

dense models are (more) compute heavy, so are generally worse to run on mac. mac tends to be better for (larger) MoE models.

27B dense can fit on a consumer graphics card. Even without getting into various "intrusive" ways to shrink the size of a model (e.g. REAP), something like a NVFP4 quant of Qwen3.6 27b

https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4

should fit within ~22GB of VRAM. So easily on a 5090. It would also fit on a 3090/4090, but iirc they don't have NVFP4 natively, so you would want a different quant for them.

you can see /r/LocalLLama for some discussions. See this (random) post about Qwen3.6-27B on a 3090 at ~100 tok/s

https://www.reddit.com/r/LocalLLaMA/comments/1ujo46r/qwen_36...

Note that it is possible you could still do this stuff with a mac, as there are ways of hooking up a eGPU to macs and using it for inference. My understanding is they're all fairly hacky though, so it would likely be preferrable to just get a 3090 (or a non-nvidia option, e.g. an AMD r9700 pro has ~32GB of VRAM for much cheaper than a 5090.

https://www.reddit.com/r/LocalLLaMA/comments/1u50hnm/qwen_27...

that seems considerably slower though (~30 tok/s). I don't know if that's an outlier/misconfigured setup or what. In general there will be much better resources for local setups using 3090s, as they're quite popular. Note that 3090s (but not 4090s nor 5090s) have NVLink, so you can network the cards fairly effectively. For this reason 2x 3090 setups are fairly popular as well. I've heard that club 3090 makes that relatively straightforward

https://github.com/noonghunna/club-3090

but don't have experience myself.

People want to make it seem like you need to always use the latest and greatest frontier models to be taken seriously as a developer.

You really don’t need them. After a certain point, bigger models give diminishing returns. If you can get 80% of the productivity gain with a free local model, use the local model. It will still be way faster than doing everything by hand, but you also don’t have to pay for tokens to a cloud provider and the tools won’t be ripped away from you on a whim.

This is the new attitude enlightened people should adopt. Reject the arms race.

The biggest appeal of the frontier models is for those trying to get autonomous agentic systems running that do real work with minimal human input. I went down a rabbit hole trying that with frontier models, and after a lot of initial promise it ended up actually slowing me down.

We've all been through that no? In the beginning you can do a ton of stuff without reading code. But the LLMs miss all the good abstractions, they just push and push unmaintainable code until at some point you start having more bugs and then you NEED that LLM to fix the codebase you don't understand anymore.

There are guardrails you can and must add to protect your team if you take the vibe approach: a good type system, a good database with clearly written business model and a good data model to drive your business. Make it loud and clear when something breaks with your tooling.

But... I'd definitely not vibe everything after a certain point. Reading and fixing code is also a lot of fun.

They're insanely good for prototypes though. To be able to actually see something working before deciding whether it's worth investing the time to build it for real is invaluable.

Made an account to semi-disagree with you, haha!

I have to advocate for the vibe-coded mess-colony.

There are applications where it either works or it doesn't, and it's simultaneously obvious whether it does. Think stock price prediction software. I've killed time in the evenings verbally chatting with agents about that specifically, and what emerged worked! It didn't work well, but it clearly outperformed randomness, and I was able to verify that myself easily.

I didn't look at a line of code, but I had an absolute blast.

You couldn't have possibly verified that. Stock prediction based on what? What's your sample size over what period of time? Using what indicators? how far is your lookback?

This was a toy that I made.

Are you familiar with the concept of a Markov Chain? (If not, it's a simple tool that technically works better than randomly guessing for predicting stock movement.) I designed a very intense neural network meta-architecture, applied it, and the results were the same as if I'd used a basic Bayes model or Markov Chain. Which is a little humorous; I very much used a bulldozer to sweep the garage.

I used close minus open to determine up vs. down movment. Can't remember the lookback, but was predicting the immediate next day. Over the entire US market, a basic Markov-based model can predict the next day 52.5% of the time or something like that. (Given 1000+ stocks, you guess which direction all will go, 52.5% will be correct guesses.)

For what it's worth, I don't really know the details of the statistical tools. I do have a good grasp of train/test/validate sets, so I know what my results meant.

> People want to make it seem like you need to always use the latest and greatest frontier models to be taken seriously as a developer.

Except you kinda do. Try getting a job today without mentioning Claude experience. In another year it'll probably be something else. Saying you like to use Copilot today makes one seem elderly.

Not saying you need frontier models on a technical basis, but for career PR you probably do.

I never got into any of the AI models because it was clear local first was going to be more valueable, if they were to replace coding tasks.

I tried out a few models and ended up going with either Qwen3-Coder-Next (no think, just do) and Qwen3.6-35B (thinking, w/llamacpp token budget). Created a customized prompt that works fairly well to around ~60k tokens and then is a toss up on whether it's poisoned itself or I've directly steered it into the wrong. When it's clear that's happened, if it's important to continue, ask it to write a doc then start fresh.

I don't kno whow any one cold have witnessed the last 2 decades of American VC funded tech startups and tell themselves, "you know, this will be a reliable technolgy with no hidden problems".

Even a sober technical evaluation is just two steps:

1. You're proposing to build a app on a non-deterministic model.

2. That model is hosted behind a non-deterministic system (model alignment, model guardrails, system context subterfuge, cost/token pricing)

---

So you want to build your app and you think you're going to kep up with both #1 and #2?

Cool! Anything you want to share? I haven't looked much into my system prompt yet, do you have any tips?

We live in a non-deterministic world. Anything "deterministic" in it is a castle built on quicksand.

LLMs are, as far as the nastiness of the Real World goes, really fucking benign. Future models outperform past models, both in open weight land and at the big frontier labs. Performance per $ only ever goes up. That's just nice.

> We live in a non-deterministic world. Anything "deterministic" in it is a castle built on quicksand.

Except the Enterprise, and a lot of what people want compute for, is built on deterministic systems or processes. I'm not saying the non-deterministic nature of LLMs isn't useful. However I've worked with a lot of organizations on SOAR projects, for example. When you can weave the deterministic and non-deterministic together you get a relatively efficient system. A workflow that will stay on the rails and will come to a conclusion as expected. And the "as expected" part is critical in these types of systems. The reality of, using SOAR as an example, is also that most enterprise would be much better served by fast SLMs. Parse an email and validate if it's SPAM / Phishing or read a chunk of firewall logs and look for outliers / indications for escalation - those things can get messy in a deterministic system because of potentially unstructured data.

I don't believe it's either / or. And I believe that LLMs just aren't efficient, fast or reliable in the sense that deterministic are. It seems, at least to me, a better together story.

I think it might be built on something more than deterministic systems. Some property that is a subset of deterministic, so all your argument still apply, but merely being deterministic is not good enough.

LLMs are what made me start considering this. Imagine a company using an LLM that was fully deterministic. All RNG was either removed or seeded in such a way that the same input (so many the seed counts as part of the input) gave the exact same output. Fully deterministic.

But such an LLM, with a slight drift in input, could still produce very different outputs. This isn't being non-deterministic, but more than the change in outputs does not naturally follow from the input. I'm thinking like how 2 double pendulums can (but not always do) greatly diverge given a very small change in their input.

So in light of that I've begun to call this new property non-chaotic. So Enterprise depends on non-chaotic systems, which are a subset of deterministic systems, and then wrangling the chaotic elements they cannot remove as much as possible.

The follow question I now have is if all LLMs are inherently chaotic, or if it is possible to have a non-chaotic LLM.

YES, but you seem to not understand that having two non-deterministic layers is incompatible. #1 is fine: it has random issue and you build around those random issues; those issues don't change unless you change them.

#2 is not fine; that non-determinism you do not control, have no insight into, etc.

I'm saying sure, give me #1 if it means I can build a harness around it and smooth over the edges. But I'm not taking #1 and #2. There's zero reasonable way to manae two non-deterministic systems.

Qwen is the Alibaba distilled Anthropic Claude model

So piracy on an by piracy trained ai model..

Piracy? Lol.

Alibaba didn't steal Opus weights, they used opus output to train their model.

If this is piracy, then so is reverse engineering efforts powering a bunch of Linux drivers.

If that's piracy, I'm going to the library and arresting everyone there!

Also, yeah, they already stole their copyrighted works, so a thief from a thief is still...theives?

Well, Anthropic got paid for it, unlike the sources that they used...

I'm not sure what you're trying to say. Is that a good or a bad thing? Model distillation is presumably part of the reason why Qwen is so good, yes. As a consumer, that's a good thing I would say. It's a natural counterbalance to the monopolistic tendencies of other tech segments.

If you have ethical concerns, model distillation feels like an arbitrary line to draw. Why is the first type of piracy ok, the second not? You should restrict yourself to ethical open source models. Which is btw where I genuinely hope the future of local models is going to lie. Open weights is not enough, we need fully open source models to be sustainable. Even for simple things like updating the knowledge cutoff. How we are going to distribute the training effort will be an interesting problem where I don't see an obvious solution yet. Maybe the blockchain/federated learning people can suggest something. Or university consortia, or some public sector solutions. Or something really boring - I for one would absolutely be willing to pay for DRM-free weights of an open source model (even if I could pirate them for free).

Are you saying 2 wrongs make a right

I'm saying, either you have a problem with the copyright issues related to AI training or you don't. If you do, neither Qwen nor Claude are acceptable, if not then both are. They have similar moral standing to me.

Btw, ethically sourced, open source LLMs exist! Check out eg Olmo by Allen AI: https://allenai.org/olmo

Same here, I’ve removed my credit card from Copilot and won’t be renewing

What features/workflows have you added?

Web search, MTP (speeds up generation), uncensored models. Lots more things on my bucket list (eg various things related to image generation).

Not gonna lie, if you're coming from ChatGPT/Claude Code, you'll mostly be adding back features you've taken for granted, or solving problems you wouldn't have had. But sometimes you do get some extra utility, like uncensored models, which have become my go-to. Not because I'm doing anything saucy, but I hated how I'd become trained to pre-emptivly self-censor my prompts. The guardrails in open weights models are no less strong than in proprietary ones, subjectively even a bit stronger in Qwen. But luckily there's an entire sub-discipline of model ablation. Another advantage would be better control over image generation (although I can't attest to that, yet).

[deleted]