I run Q4_K_XL. All it takes to run to get about 6tk/sec is 512gb of ram and 2 3090 GPUs with llama.cpp -cmoe. I also have crappy DDR4, 2400mhz, 3200mhz will bring that speed up to about 9tk/sec. I also have ok 32core epyc CPU, a better 64core would bring it up to about 11tk/sec. I did a budget build before the crazy hardware cost and I regret it everyday. Nevertheless, it's fantastic being able to run this model at home. It's great for planning, one shot prompting once you have a plan or all the context you need. This entire hardware cost $2400 when it was built. If you're willing to be resourceful, you can find ways to run these models at home. I often get the silly question of why, and suggestions about how much I can save using cloud API, but the Fable drama has opened up eyes on why it's good for us to be independent. Thanks team unsloth, Q4_K_XL is solid, if you are going to grab a quant, make sure to get the K_XL variant if it can fit.
I applaud all you tinkerers for pushing on the state of the home-brewed art here. Like crypto, AI is drowned out by hucksters, very few people talk about developing resilience. Or the researchers who will push on open source models in efforts to cram them onto an electric toothbrush or tamagotchi. Bravo to you all.
Running that full load is at least 600 W, so in a day ~14 kWh. At $0.2 a kWH, that would be $2.80/day or $1k a year of op-ex in electricity.
Unless you really want privacy or the fuzzy feeling of owning your own, it’s cheaper, more convenient and has much faster tok/s if you pay a hyper scaler.
That said, I do like the direction we are heading and look forward to seeing what host your own hardware we get in 2 years.
No one locally runs full load all day. The only way to see that is if you're training. We are talking about inference. I limit my GPU to 300watts. You can limit them down to 200w. Since everything is not in GPU and the bottleneck is between CPU/system ram. The GPUs don't even get to spike, I see 160w-180w for each GPU during inference. So redo your calculation again. Figure about 6 hrs of daily inference, and we are down to roughly $125 a year. Thanks again for your speculation.
Not everyone lives in a place where electricity is $0.20 a kWh. For instance BC Hydro residential rates are $0.11 (CAD) for the first tier and $0.14 for the second tier of consumption in a month. At current exchange rate $0.14 CAD is $0.099 USD a kWh. Hydro Quebec is even cheaper.
At a theoretical 6 tok/s, 86400 seconds in a day, approx 500,000 tokens of GLM5.2 output for 2 bucks a day seems like a pretty good bargain to me. Of course not counting the one time cost of the hardware to run it. But I see people dropping $4000-5000 on all kinds of much less useful stuff.
Additionally in a place where people use electric baseboard heating or electric in floor radiant heating, or really any other heating element based system in winter that's less efficient than a heat pump, additional electrical from a computing load is basically "free" since you would be spending that same money otherwise to heat your house. If a computer with 512GB of RAM is dumping the waste heat into your room, it accomplishes a portion of the same thing as a baseboard.
Not to mention there is a whole other less measurable benefit of having a locally hosted model that can't be turned off or arbitrarily restricted by a service provider, and where all of your queries and context cache aren't subject to surveillance by any third party.
Unless the token estimates I get from using Claude are wayyy out, I burn through 5m+ tokens/day, and I'm not doing a lot of time. 500k tokens in a 24h period for $5k of hardware seems quite poor?
Be sure you compare inputs tokens to pre-fill rates and output tokens to generation rates.
Where I live prices are often higher than 20c/kWh, but lets take your example and halve it (10c/kWh) so it's ~$1.40/day or ~$500/year.
On Openrouter, the cheapest GLM 5.2 provider costs $3/MTok (at 44 tps). Assuming most use is output tokens, that's still the equivalent of 450k token/day, so we're in the same ball park, but without the capex for 2 3090's and the machine.
Self hosted only makes economic sense if your priority is being in control / avoiding surveillance.
That's true, there's a lot of places where power is considerably more expensive than $0.20 USD/kWh. But also the 600W figure assumes that it's fully loaded 24x7x365.
Running a system that will be 600W under max CPU usage on all cores and RAM and a few 3090-class GPUs, that same system might be only 90W or around there when idle at 0.00 unix load.
If we say: (600 * 24 * 31)/1000 = 446kWh in a month at full load 24 hours a day
But it could be less, such as: (90 * 12 * 31)/1000 = 33.48 kWh of idle time in a month, and 223kWh of "full load" 600W time in a month, if it's at full load only 12 hours a day.
If you're the only user accessing it and you only "use" it 12 hours a day, that cumulative yearly dollar figure would be almost halved. Or even less if a person is using it in bursts and intermittently throughout an 8 hour workday.
The usage is irrelevant if we're interested in cost per token. If you use it half as much, you get half as many tokens at half the cost. It's still $5.56 in electricity per million output tokens either way (using $0.20/kWh, adjust accordingly if you have cheaper electricity). If you use the API, you also pay half as much if you use half as much.
> person is using it in bursts and intermittently throughout an 8 hour workday.
You can’t do that with 6 tps, though.
I think that's the biggest difference for most. If you can amortize the hardware costs, then 'burst usage' is cheaper at home to a degree, because you are paying a fixed monthly rate elsewise. Overall thought for most, it is likely cheaper to use the cloud than at home, but really depends on what you want.
> because you are paying a fixed monthly rate elsewise
No, you would pay usage based rates with API, in this case. I have exactly one fixed monthly rate for the 6 AI models I have tokens available for.
> But also the 600W figure assumes that it's fully loaded 24x7x365.
It isn't 100% efficient. Even the best PSUs aren't.
Lots of people have solar. Green AI, imagine that!
if only there was a magical place where geothermal and hydroelectric is ubiquitous and the weather is cold enough that no one is going to be complaining about free heating.
The largest geothermal plant in the world is only 1.5GW, in the United States, which is over double all the plants combined in Iceland. The second largest is 1/3 that, in Mexico. [1]
There is no "ubiquitous" geothermal where there also high power usage. Data centers have to go where power is, not can be.
[1] https://en.wikipedia.org/wiki/List_of_geothermal_power_stati...
Related, it should surprise no-one that the tech giants are interested in nuclear [1], including small reactors [2], rather than waiting for the utility monopolies [3] to raise an arm and actually generate more power [4].
[1] https://www.cnbc.com/2025/03/12/amazon-google-and-meta-suppo...
[2] https://www.sciencenews.org/article/small-modular-nuclear-re...
[3] https://floodlightnews.org/fraud-and-corruption-on-rise-at-u...
[4] https://decarbonization.visualcapitalist.com/animated-70-yea...
To be fair, Vancouver is such a magical place in terms of electrical cost, but the cost of living and real estate are otherwise through the roof, with decrepit and nasty (would need $100k in renovations immediately if you're not treating it as a teardown) single family detached homes on the east side of the city selling for 3.2 million.
Yeah there's a reason our datacentres are in Kamloops, cheap housing and a big ass river right next to it. It even gets decently cold in the winter so you can save on cooling.
There's also tons of opportunity to build them out in former pulp mill towns on Vancouver Island that have big interconnects or dedicated generation.
You'd have to be an idiot to put a datacentre in Vancouver, or have fuck-off scale monopoly money, which is probably why Telus is doing it.
Shhh don't forget we have a water shortage. But it is nice to have electricity wrapped into my relatively cheap basement suite rent ;)
You aren't, perchance, from Iceland, are you?
We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.
I think the main reason not to run locally is to get the full models instead of quantized versions.
> We do want privacy, and we also want to own the hardware so the US can't just turn it off whenever it feels like it.
I agree and I prefer on-prem where possible. The Apple Mac Studios have been great for that although I don't have enough of them to run GLM-5.2 without heavy quantization. I'm also waiting for the Apple next product refresh which I hope will enable me to do more with less.
Meanwhile there are hosted privacy-conscious options out there. Two names to look at are Tinfoil[1] and Privatemode (from Edgeless Systems)[2].
Tinfoil[1] is, sadly, US-based. EU-sovereignty-option is on their long-term radar. But they do have GLM-5.2 today.
Privatemode[2] is a German company (Edgeless Systems) with EU-based servers. But sadly no GLM-5.2 today, it is on their mid-long term radar though.
Both Tinfoil and Privatemode operate on the same concept of the LLM operating in a secure enclave and you have end-to-end attestation and encryption.
Tinfoil have not been independently audited, it is somewhere on their long-term radar.
Privatemode have been thoroughly independently audited with documentation available on request.
Both of them are API-tokens-only. So if you're currently one of those people throwing $200 a month down the pan at Anthropic/OpenAI for a so-called-alleged 'unlimited' plan, then neither Tinfoil or Privatemode will be the place for you.
[1]https://tinfoil.sh/ [2] https://www.privatemode.ai/
> Apple next product refresh
I have this feeling that it'll be very expensive and still scarce. Normally I wouldn't say this about Apple, because their pricing is part of their brand, but this time the demand (both by data-centers and prosumers) is the force majeure.
> because their pricing is part of their brand
I know people usually say that about Apple, but to be fair to them on this occasion they have not hiked up their prices yet because they are clearly at present still under some old deals that they did a good job negotiating.
However, of course, at some point Apple will run out of both inventory and old-pricing manufacturing capacity. Yes, I am fully expecting some sort of price-hike like has been seen everywhere else. I am not naïve.
When that time comes it will remain a financial calculation, Apple boxes on one side versus hosted-option-costs on another, in relation to my specific use-cases.
Ultimately I still blame the chip-hoarding hyperscalers though. :)
I guess you missed recent news. Problem is that cloud LLM might just sliently sabotage your work by downgrading output model with no notice.
Or cloud LLM might just refuse to sell to you because it dont like your passport.
So you're buying expensive hardware as insurance for the case that your cloud provider turns against you and you have to switch to another of the twenty offering the same model https://openrouter.ai/z-ai/glm-5.2 or in the worst case buy the same hardware later? How does that make sense?
It’s rationalization for what people want to do anyway.
Like buying a new car today and taking on gas, parking, etc, expenses in case the bus route you’re using goes away at some point in the future. It’s not an economic decision, it’s a desire to have the new car dressed up in what-ifs.
Yes, it is understandable that people who are subject to being kicked off the bus at random times through no fault of their own, or who sometimes find that the bus slows to 8 miles per hour and makes them late for work, or who are tired of arguing with the bus driver who refuses to take them to the liquor store, the casino, or the titty bar, may aspire to own a car, even a crappy one.
Any more tortured metaphors in store for us?
[dead]
[dead]
This is not really a problem for the open-weight models, you can always give your money to an inference provider in a different jurisdiction
Even on a macStudio w 512 gig memory?
So in my experience with 2 7900XTs with models that sit fully in VRAM it's more like 400W the gpus spend a lot of time waiting for each other.
Depends on whether you've also gone for self-hosted electricity generation or not.
I have rooftop solar and I have been building credit with my electric utility even though the daily high temperature is well over 100F outside and a comfortable 75F inside. That includes running three AMD 12 thread 128GB systems with obsolete GPUs 24x7x365. I'm not a gamer, so 6 years ago I went low-end low-power GPUs. Boy am I dumb. Currently running the qwen3.6:27b, 35b, and gemma4:31b models just fine.
As soon as VRAM prices drop to sanity I'm going to load up and I could care less about the power draw.
Some parts of the future are absolutely great.
which hyper scaler would you suggest ?
how do you rent 2 3090s for $2.80/day?
AIUI the llama.cpp implementation for this model is still quite half-baked due to missing the support for DSA sparse attention mechanism. This leads to running the model with a different mechanism that it has not been trained for, which has been shown to lead to lower quality and performance.
Anyway, I think GLM 5.2 in many ways is not as interesting as DeepSeek V4 series, which uses an even more advanced attention mechanism and can save a lot of memory capacity for KV cache, especially at larger contexts. Which in turn opens up wide batching especially on consumer platforms. GLM doesn't have that, in some ways it feels broadly similar to Kimi 2.6 wrt. the underlying performance architecture. Both are a bit too heavy to run reasonably at full quality on ordinary hardware.
Particularly DeepSeek 4.1, which they appear to be A/B testing on the API and which also seems available on the free chat interface.
It also has an input image modality, which is a game changer. The cheap Sinofrontier models have generally been lacking in this regard.
Basically, Chinese competition is fierce - DeepSeek set the pricing tier, and the question for each lab now is how to justify charging a little more.
MiMo-2.5-Pro has gone with UltraSoeed, pumping out 1000t/s for a 3X price hike.
GLM has gone with 5.2, hitting Opus levels of reasoning at a fraction of the cost.
DeepSeek will probably keep their pricing model and just keep getting better and better.
Qwen-3.7 is the dark horse. Some rumours are Alibaba is simply making these models because they need them internally.
The real question is why this level of innovation and competition isn’t happening in America or Europe. In particular I see no reason Europe doesn’t have a lab competing on these terms.
Competing and innovating in the fast moving SOTA end of the llm space requires a ruthless disregard for copyright, IP, bureaucracies, formalities, risk assurances and other slowdowns. It requires a risk tolerant, quick and large flowing investment of capital. It requires a scoped focus that is pragmatic and sharp about key concerns, and efficiently dismissive of meaningless details.
Europe can provide none of this. They will never be at the frontier of AI tech, for the same reason they were never at the frontier of any tech.
I say this as a software engineer from Europe.
I’m not completely convinced that America and China are both lawless free for alls, and that that is what’s required for AI innovation.
Europe was never at the frontier of any tech? Huh what now?
A hyperbole born of frustration, I admit.
Qualify it to software, rather than all tech, if you will.
Not since the salad days of Nokia. Ancient history at this point.
6 tokens per second is not fit for interactive use. I find Gemma 4 (QAT 4-bit, MTP) to be tolerable at about 30 tokens per second on my old GPUs. Anything slower than 15 is annoying. I tried DS4 on my Strix halo (1-bit quantization of DeepSeek V4 Flash, the biggest model that can realistically run on 128GB, right now), and it tops out at something like 10 or 11 with a long time to first response, and that's quite painful to use. I'd definitely rather spend money to use the big models on cloud infrastructure.
And, the several thousand dollars it costs to run these things unusably slowly buys a lot of tokens on the cheap Chinese models.
"All it takes to run" might be fair if you paid $2400, but right now the total price is way closer to $10k (almost 5k for the RAM and 2k each for the GPUs). Today that is a lot of expensive hardware.
512gb 2400mhz ddr4 ram = $1600 not $5000. https://www.ebay.com/itm/188284985172 You can get creative and source 2-3 2080ti 22gb from China for about $250 a piece. You can either be resourceful and find a way or find a whole bunch of excuses.
> You can either be resourceful and find a way or find a whole bunch of excuses.
How about addressing this false dichotomy with the likelihood that someone who is new or interested in a tech isn't willing to drop thousands of dollars on used hardware for a whim or learning exercise.
LOL, sure this works if one has a time machine or a LOT of money to burn.
32 CPU Epyc (Epyc is required for faster memory access) + 32 GB VRAM + 512 GB RAM is stupid expensive nowadays, and in best case, it will just downgrade to "very" expensive at some point in the future.
This makes sense only if 1. one is paranoid about privacy or 2. they have money to smoke or 3. they need to workaround cloud model restrictions, AND they have to do it routinely (because if not, a oneshot cloud bare metal setup is way cheaper, faster, and allows more powerful models, due to VRAM offering).
I did spend stupid money as well and yet, the system is 2x slower than cloud providers for comparable performance on vision tasks (I still have to test coding). Oh, and it's hot as hell.
6 tokens per second?
Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones
I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.
But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.
Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.
Not a Local LLM user, but I regularly kick off meaty jobs in Claude Code then check on them 1-2hrs later.
In this case it would be 20-40 hours to accomplish the same amount in f work when running locally
Run one task, while you do another? Or while you sleep / eat / rave?
While my colleagues are running 6 parallel agents at 50-100t/s each, with an actual SOTA model? Don’t you think I‘d get fired after a few weeks of that?
I agree single digit tk/sec is painfully slow, but I also doubt anyone with these local/homelab setups are using them for work. Likely fire off and check back later. That said, I've had terrible results one-shotting so you'd need to design with a faster model or have extreme patience during the discovery/design phase.
Do you work at Facebook and happen to find yourself in a token burning competition with your colleagues?
Why would you use this when your company has access to actual SOTA? I don't get it.
Here's a thought experiment for you. Let's say you can run 1000 agents at 10,000 tokens a second. Do you think you are going to be more productive than someone running at 6tk/sec with the same model?
Incase it's not clear, you will be generating 10,000,000 a second. Good luck verifying it. Token generation is not the bottleneck for creative work. If you are doing a predictable work and have a good workflow and massive dataset to process, then speed of token matters. If you are performing creative work like coding, it doesn't.
do you use caveman or similar?
I get a lot done with something that's also approximately 6 tokens/second, if you're willing to give it a well defined set of prompts and projects to work on, leave it for an hour or two, then come back and check what it's done. And often to remember to give it something of more consequence to do for at least 3-4 hours of wall clock runtime before heading to bed.
I have pretty much almost this exact setup with 2x3090s and with slightly faster DDR4 512GB and 64 core Epyc! [0] I've been enjoying it a lot. Can't wait to give this model a try.
Apart of running local models I use this rig as my main remote development platform. All Claude Code sessions are running there in tmux now. And my fingers can't be happier not having to deal with constantly hot laptop. Not to mention that Claude Code is such a battery hog.
[0] https://medium.com/@rathko/i-built-an-epyc-64-core-512gb-ram...
How can you combine CPU cores and multiple GPU? Are you running some layers in cpu, others in gpu #1, and others in gpu #2? What about the bandwidth and latency between them?
Or maybe the model itself only runs at gpus, and the cpu memory only store the weights for experts not corrently activated? If so, then what's the 32 or 64 cpu cores for?
I'm a big fan of fully utilizing one's hardware and it's kinda sad that it's not the norm to run things on either gpu, cpu or both, dynamically choosing at runtime, for everyday software
Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.
This is a good place to start reading about dual gpus.
https://github.com/noonghunna/club-3090/blob/master/docs/DUA...
But in this case he used a cpu too
checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.
Very cool. So it's not just about GPU VRAM which I incorrectly thought. I though you'd need 512 GB GPU VRAM. I don't think it cost only 2400; 512GB ram would be more expensive though back in the day. But not mortgage-grade 200.000 which I estimated myself (which assumed running in 100% VRAM; overkill for a single user probably).
you can use system ram with a system like llama.cpp which offloads to system ram. token generation is a function of system bandwidth, the faster the bandwidth the better. so I'm on 8 channel 2400mhz. if I had a 12 ddr channel, I would get 1.5x the speed at 2400mhz. of course ddr5 is much faster, so a 12 ddr at 4800mhz will provide 3x the speed for token generation or roughly 18tk/sec. prompt processing is all about compute, so the more cpu cores you have the faster it can do PP.
Well, it's about GPU VRAM if you want something competitive with cloud-hosted offerings at the performance levels showing in benchmarks. This is a heavy quant with quality degradation and significantly lower performance.
Cloud offerings are 80-200tk/sec versus single digit tk/sec.
That said, I'm also surprised it runs at all locally. I do think it'd be painfully slow for anything interactive so you're relying on another model for a comprehensive design or you're hoping a one-shot with somewhat degraded quality turns out correctly.
I see. So not quite usable apart for specific use cases. Maybe in a few years we'll see new hardware players and better prices.
I can work out max 90GB to the agents. Advise. :)
That's crazy good for $2400.