wonder if AMD's new ai chip can run this with ease? I'm seriously consider buying it. GLM 5.2 is just shy of GPT 5.4 so I would welcome offloading any grunt work locally
I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
This should put more pressure on the frontier models to avoid sitting on any fancy stuff and lower token prices as a whole.
Nothing beats a local LLM disconnected from the cloud.
Are you talking about Medusa Halo? It's going to support up to 256GB unified memory (up from 128GB for Strix Halo and 192GB for Gorgon Halo). That might just be barely enough to run a 2-bit quant GLM-5.2. It will expand memory bus to 384-bits, vs. 256-bits for Strix Halo which will help with bandwidth (projected to be around 500 GB/sec). But don't expect Madusa Halo-based machines to appear until sometime in 2028.
The other way this could go is that Z.ai could decide to release a smaller model targeted towards coding. They've done that before (GLM-4.7-Flash had 30B params). It would be great if they decided to release something in the 80B-100B param range. Something that size would easily run in a current Strix Halo system.
Strix Halo only supports 96gb of video memory then it goes to 32gb to the host system.
No, I can go upto 112GB on my Strix Halo box running Linux. There are a few boot params to adjust, but it works.
yeah you are correct 2 bit quant won't be enough
guess we'll be paying $200/month for a while
> I am very excited for local LLMs I think we may have GPT 5.5-xhigh level of performance for under 2000 EUR
We are maybe 10 years off that.
RAM prices are going to continue to increase for the next 2 years at least.
Even putting that aside it's currently around 40-70,000 EUR to run this with a FP8 quantization (which you need to get close to maximum performance).
To actually get GPT 5.5-xhigh performance in the real world you need more headroom to support things like subagents (which will fill up your KV cache).
I like local models but realism is important. The sweet spot for the next 3 years will continue to be ~35B MoE models. They might match GPT 5.5-xhigh for chat-style problems but not for coding.
I wonder, if in the near future any acquisitions of some RAM producers with intent to just keep RAM prices up, will happen from the AI companies. It could seriously hurt their business, if companies will be able to host their AI in some time.
I think AI companies have enough things to spend capital on already.
[dead]
At full quantization GLM 5.2 may be close to GPT 5.4. But at Q2 or whatever one needs in order to run it on a pro-sumer device it will be worse.
Also I m not sure where you are getting the under 2k value. I bought a Framework desktop 128GB last year and my setup was around 2.7k. The same setup now sells for around 4.7k.
Even with upcoming AI Max+ PRO 495 we are capped with 192GB, so no...
"GLM 5.2 is just shy of GPT 5.4"... If your running the full model. As in have 750 (FP8) to 1.5TB(FP16) of memory available.
Do not mix the benchmark results of GLM 5.2 FP16/FP8 with FP4 or FP2.
* FP4 will mean a accuracy loss of about 3%. Not noticeable but more chance for mistakes.
* FP2 ... what is what most people are able to run at home, for a "reasonable" price. Your looking at over 17% loss in accuracy.
At that point, your running at less then claude-sonnet-4.6, as the issues compound with accuracy losses. And reasonable priced is still in the ~ $5000 range (192GB + GPU 32GB active/kv cache system).
For that price your using a Codex / Claude Pro subscription for the next 4+ years with better models (by default), let alone with a FP2 GLM 5.2 version. And your looking at < 10 fps. A MacStudio with 512GB will net you 18 a 20fps+ with FP4, but ... i mean, those used to be $10.000.
Unfortunately the local hardware cost is a major issue for running large models like that.
Edit: Its funny whenever the issue of cost and what you need to give up vs the subscription services, there are always people who downvote in bad faith.
The cost of local hardware is amortized if a whole team uses it instead of just 1 dev (GPUs are extremely underutilized if you launch just 1 generation stream). I'm not sure why everyone always assumes solo devs with Macs. We've just ordered a large datacenter-grade node for use by the whole dev team, and the calculations show that it's going to cost the same amount of money if we kept using AWS Bedrock (infosec reasons) for a couple years but... it gives us 100% privacy, we're immune to all the AI regulation dramas in the US/EU, all the random outages, and the developers won't have to think about token limits/weekly caps etc. ever again. And all that with a model which is Opus-grade
(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)
Yes but unfortunately a lot of the discussion that people participate in, are not done from a corporate point of view, but from a normal consumer level.
And there is a lot of drama in those discussions. GLM 5.2 is a great model for corporations to run, but people only want to hear about running a 35B/27B or maybe a 120B model. And in that market, subscription services are simply way better value for money (take in account the privacy issues).
Everybody wants GPT 5.5/Opus 4.8 Max levels, on a budget that simply is not realistic. And GLM fit in that 4.8 medium/low level.
But then people do not want to be told that running a 750b model in Q2 or Q1 is just going to destroy the models accuracy. And that is still going to cost them 5k+ for that reduced model.
The whole local llm landscape from a consumer point of view, is just filled with odd people. lol.
Corporation really benefit from those models, because spending $90k on a server, is a deductible expense. And they are billed at token prices anyway from all the major providers. So its a even faster ROI on that hardware.
I am surprised that nobody figured out to make a business of selling leftover capacity from corporate llm installations, because there is easily 12h+ just wasted (unless its a large corp that has people in all timezones).
> GPUs are extremely underutilized if you launch just 1 generation stream
why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?
I have no intuition yet how this works under the hood.
Some of the inference engines can process multiple requests in parallel more efficiently than doing them sequentially. Not sure of the exact mechanism but e.g. llama.cpp's llama-server can do this (you tell it the number of slots to have when starting, then fire HTTP requests at it and it batches them together when it can).
Waiting for the hooman (or tool calls) won't help either, of course.
The mechanism is that generating tokens (the "decode" phase) in an LLM is limited by memory bandwidth for the weights, so computing multiple streams amortizes the bandwidth over streams as long as you can keep the contexts in RAM. This is most true for dense models and the always-on expert in MoE models, or when you have significantly more streams than experts for MoE models.
In contrast, prompt prefill is more easily compute-bound, so there are interesting trade-offs for latency of decode vs prefill when the LLM utilization is high.
you are right that means GLM is still quite far off from truly competitive
i think your answer was perfect not sure why you are being downvoted
The AMD 395 supports up to 128GB unified RAM. So still not enough even at 1-bit quant unfortunately.
96gb vram is the max it supports.
That's the max you can statically allocate in the BIOS. It's best to leave that at the minimum (500 MB I think), and let the drivers dynamically allocate. You can use up to about 120 GB on Linux.
Under Linux it is allegedly 110GB, but I’m not sure.