I did the math at least on a Macbook pro, and for inference it's definitely not worth it.
- https://www.williamangel.net/blog/2026/05/17/offline-llm-ene... - Discussion: https://news.ycombinator.com/item?id=48168198
I did the math at least on a Macbook pro, and for inference it's definitely not worth it.
- https://www.williamangel.net/blog/2026/05/17/offline-llm-ene... - Discussion: https://news.ycombinator.com/item?id=48168198
That's the case with Self-hosting anything. It is the privacy that matters.
Not necessarily. I was spending ~$150/month on vultr's kubernetes hosting. I spent $5k building out a pretty awesome 1U server and I put it in a colo that costs me $50/month. Next year I will break even financially and everything after that is saving money. I also am getting so much more out of this server than I was getting on vultr because I over-spec'd the machine. In addition to running more on my cluster, I spin up large virtual machines for development, experiments, and for offloading distributed builds. No shade to vultr, but owning my hardware instead of renting was absolutely the way to go. Unfortunately today the ram alone would cost over $5k, so the math has changed.
Privacy of what in this case?
One value of learning on my Macbook is that mps is not as well supported as cuda which forces me to go down roads I would not have traveled.
That's more of a disadvantage. CUDA is an industry standard, MPS/MLX/Metal compute shaders are a novelty.
Except this math is 10x too high (unless accelerated depreciation is all of it) - a million tokens at 28 tokens/sec and 75W and 20c/kwh should cost $0.15 not $1.50. (And less with MTP.)
It's comparing laptops to dedicated GPUs in a server environment. The best comparison would be the Mac Studio but the current release is almost 2 years old at this point. We'll see what a likely M5 Ultra Mac Studio looks like, probably in Q3 this year.
But yes, for pure inference, the M5 Max Macbook Pros probably aren't there yet. They have other utility though of course. And you can get 64GB and 128GB MBPs at a discount. Micro Center currently will let you buy a 64GB M5 Max MBP for under $4k currently, for example.
Why didn't you take into account batching, input tokens, different costs of electricity, and the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
> Why didn't you take into account [...] the fact that a laptop can still hold a decent % of its resale value, and is useful for many other tasks than running an LLM?
Because that wasn't what they claimed to research?
It's entirely fine if you enjoy local LLMs on your computer, there are people doing horribly inefficient inference on smartphones now. But for pure inference tasks, it's pretty obvious why M5s and Mac Studios aren't replacing TPUs and GPUs.Who is going to buy a $4299 M5 Max MBP with 64GB of RAM just to run Gemma 4 31b? Firstly you don't need 64GB for that model. Secondly if you want a machine that sits in the corner and does nothing but LLM inference, you don't buy a MacBook Pro, you buy some GPUs which are going to cost you a fraction of that (~$1k for ~64GB of VRAM is possible). The people buying Apple Silicon for inference general aim for the Mac Studios with enormous amounts of RAM (128-512GB), to run very large models.
The idea is obviously to be running the LLM on your work laptop. As a developer I'd need a laptop with 24GB of RAM for work anyway, and 48GB, which is enough for a very good quant of Gemini, is just $400 extra.
24GB GPUs are $700-2500. Please show me the 64GB GPU for $1k.
Not a single new 64GB GPU, but multiple used GPUs.
They’ve significantly increased in price (so much for hardware depreciation…) but you can still get a modded 22GB 2080 ti for $320, or a Mi50 32GB for ~$450 each (used to be $150 a few months ago, alas), or a Mi50 16GB or <$200 but you’d need to stack 4 of them.
There’s also some more exotic configurations but those are probably the simplest options. You won’t get the performance of an RTX Pro 6000 Blackwell of course, and the power consumption will be pretty high so it’s only worth it if you have cheap electricity. But it is possible.
All the 22GB 2080 Ti are now $450-600.
> Firstly you don't need 64GB for that model.
You might need that to run it with a longer context, KV cache size is a known issue with that model series.
> Gemma 4 31b? Firstly you don't need 64GB for that model.
You don't? It for sure doesn't run on my 32 GB M2 MAX.
What quant? You should have no problem running it at Q4 with 256K context, Q5 or Q6 even although maybe not at full context. I can run Q4 on a 4090 with just 24GB VRAM.