>64 GB

Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.

My past attempts (with Ollama and various LLMs) were too slow to use.

I have a M5 MAX with 128, local models are toys compared to hosted ones. I've spent a lot of time and money trying to make it work even 1/2 as well.

It all depends on what you want to do, I guess.

If you're seeking the kind of hands-off claude experience, obviously not. They are slow.

If you want to learn how these things work, train them locally, tinker, play with the code, grasp the fundamentals, or just out of sheer bloody-mindedness and principle refuse to tether the functioning of your application to a cloud API...

I have the same processor and ram. The dense 30b ish Gemma/Qwen really don't break 10 TPS with or without MTP. MOE's in this range feel more usable if they are smart enough for your work. Probably would still use hosted versions of these over local unless. MOE's feel somewhere between sonnet 3.5 and 3.7 to me. Dense feels between sonnet 3.7 and 4 in basic coding or local agentic capabilities (not close to those in chat or world knowledge)

From an economical point of view, there's almost no point to using these locally running models. The only things they are good for would be dirt cheap using the smaller/older models via some API as well. Recovering the investment for the hundreds/thousands you spend extra on hardware easily funds a lot of that. Unless you are using this stuff at scale, it's probably not going to be worth it.

I've dabbled with Qwen 3.x and Gemma 4 models a bit. They are alright but not that impressive. And my mac gets super hot if I use them for extended periods of time. It's just not very nice to use locally.

[deleted]

[dead]

Some of these models will be a bit of a squeeze at Q4_0 I suspect; almost certainly they will be using CPU. Probably the 31B Gemma will be too much. Maybe not the Gemma-4 26B QAT.

But if you just want to play around rather than code, you really might find the Gemma 4 12B model worth mucking about with just so you've gone through the steps. Especially if you want to muck about with image analysis or audio transcription.

If you're writing PHP I think you could even find it good enough. I've been modestly surprised. You can do that basic fiddling with the Edge AI Gallery app, which can enable thinking and has a customisable system prompt and some agent support.

You could also try the 14B Deepseek R1.

Honestly even if it is not good enough, if you are anything like me, I think you'll find that going through this process is really quite educational — it has made a lot of things more concrete for me in a way that I have found reassuring and valuable.

i’m running m4 pro 48gb right now

omlx + gemma 12b 6 bit + pi

it’s feasible for sure

MoEs for speed (qwen 35b, cohere 30b, gemma 26b)

Dense for more methodical work (qwen 27b [reigning champ], gemma 31b, gemma 12b)

MoE i recommend 5bit+

Dense i think 4 bit is okay

Play with your context size, you don’t really need that much, have lazy loading for tools and mcps

my pi extensions for anyone looking for a skinny quick setup, i have use `--no-skills` right now too:

    "npm:pi-codex-goal",
    "npm:pi-simplify",
    "npm:pi-mcp-adapter",
    "git:github.com/elpapi42/pi-minimal-subagent",
    "npm:@wierdbytes/pi-statusline",
    "npm:@aliou/pi-guardrails",
    "npm:pi-lens",
    "npm:@juicesharp/rpiv-todo",
    "npm:pi-hashline-readmap",
    "npm:@mrclrchtr/supi-review",
    "npm:pi-cmux",
    "npm:@mrclrchtr/supi-context",
    "npm:pi-tool-search"

think of local models as "zero sugar" models and that's where we're at right now. I think it's crazy how good these models are compared to last year's frontier models

I'm running an M3 on an Air with just 16GB. I can still get useful results without an internet connection in "chat mode". It's a different experience than using Claude, for sure, but it's workable. I typically use the Qwen variants these days.

This might be useful when ‘coding in chat mode’: I have a few scripts that I run in a project directory that takes a prompt from me, and creates a single long one-shot prompt that I can paste into a chat window and ask that any generating code is inside markdown code blocks for easier copy/pasting. Also, pardon the plug, but you can read my new tiny book free online that documents my experiences using agentic coding on my 16G Mac and my 32G Mac: https://leanpub.com/read/local-coding-agents

Looks cool, I’ll checkout the book. Your download links (PDF and EPUB) are down for me.

> NoSuchKeyThe specified key does not exist…

People are using 3090 (24GB) to run models, and it is the most cost effective way to run the. Yes, it is 2x faster, but memory wise you surely can spend 24gb on llm.

Also there are smaller, still usefull models that can run on 8GB or less.

I've an M1 Pro with 32GB ram and it's running pretty well

M4 24GB here. You'll be fine, if you're anything like me minor latency is acceptable to obtain (a) privacy (b) reliability (c) CI/CD/guardrails (d) network independence (e) future-proofing vs. AIaaS. https://omlx.ai/ gives you intelligent local hardware based model download recommendations. That said it probably depends heavily on your workload, process and polish expectations. See also https://news.ycombinator.com/item?id=48089091

what are you using on yours? I've got a M4 Pro 24GB also. tried the open source gpt one. it's alright but I found it can get stuck at times. maybe just my config in LM Studio.

pi + Qwen3-4B-Instruct-2507 / Qwen3.6-35B-A3B-4bit / Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-4.5bit-msq depending how seat-of-pants I want to fly on memory.