So close! My machine with 192GB RAM + RTX 3090 24GB can almost run this. It says it needs 24GB of VRAM and 256GB of RAM for MoE offloading.
https://unsloth.ai/docs/models/glm-5.2#usage-guide
In a prior thread, someone said it would take $500k in hardware:
$500k is a vast overestimation. For massive concurrency at FP8 or even BF16 maybe.
NVFP4 at reasonable speeds (~120 tok/s) and concurrency is possible at a $80/90k figure with today's prices, maybe even less. That buys you 6 RTX 6000 PRO Blackwells, a decent CPU and motherboard, power supply. 576gb of VRAM.
You could do it for under $50k if you're OK with 40 tok/s decode, ~1200 tok/s prefill.
Yes, a single GB300 workstation also does it, probably even more than 120tok/s.
Official price 85k...
Actual price $100k and everything is very closed and proprietary. Oddly this MSI system provides "only" 252G vram and 500G ram. I would have expected more vram for this price. Also why 252 instead of 256? https://www.centralcomputer.com/msi-xpertstation-ws300-ai-wo...
How fast will the hardware become outdated? Are there big improvements expected in the next 3 years?
M5 Ultra will ship before end of year, likely. Though with current RAM shortage, likely max spec will be 256GB and in short supply.
In late 2027 or early 2028, Nvidia will release Vera Rubin DGX Spark, likely with double or better the performance of current Blackwell, though unclear if memory capacity will go up much from current 128GB. Two to four of those will run models like this decently.
In 2028 we should expect Vera Rubin RTX discrete lineup, including the replacement to the RTX PRO 6000. Likely memory spec will be minimum 128GB. Good chance of up to 200GB. Two to four of those will run NVFP4 models in this class very well.
It might be M6 Ultra and I think the real reason for stopping selling top-tier units was to avoid mid-generation price hikes and increasing demand for the more expensive next-gen systems that I assume will come with 512gb (maybe 1TB) of RAM and a massive markup to match.
I hope all this speculation comes true. Right now this ram crunch is ridiculous.
I feel like the models are good enough for a decade of future work. So Once you have a working set up you can keep using it to do the work at the same level. There will be better stuff and may make that type of work obsolete but if you can do useful things it won’t be worth less.
I think there is a gap right now for running large models such as GLM 5.2 in Q4 or Q8. My hope is on Intel Crescent Island 480GB cards. Let‘s see how expensive they‘ll be.
480GB? Probably like 100k$ each? :D
P40 was release 2016 and still selling like hotcakes!
You can get a 1TB of HBM2 vram for like 10k, https://www.ebay.com/itm/177571378959
The problem is the backplane I have not managed to find a single baseboard, and getting a random baseboard to work with random modules is probably a crap shoot.
[dead]
With 2 wouldn’t have good results. Ideal range for coding is at least Q8.
According to this very article, 4-bit dynamic is essentially lossless
Watch out. Those claims are often made based on KL-divergence over some arbitrary corpus, not performance in the real world or benchmarks.
I’ve found that I need to go a couple steps past whatever quantizations are good enough in the KL-divergence testing to get good performance in real tasks with long context. So when Q4 is claimed to be lossless I end up with Q5 or Q6 for actual long-context tasks.
Crossing my fingers that this boom jumpstarts 90's like improvements in computing hardware.
I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
Most of the money and energy went to mobile for the last fifteen years.
Affordable local inference might be the gravy train the server, desktop, and laptop manufacturers need to get back in gear.
Definitely the stagnation was due to a lack of use cases, but this isn't a bad thing. We don't need most of the hardware advancement we got.
Business hardware got beefier because businesses demanded more data (or more specifically: the industry told businesses they needed more data), with no idea of what to actually do with it once they got it. To get all that data, bandwidth needed to be increased, with more iops to read/write it, more storage to keep it, and more memory and cpu to process it. But 99% of the data is junk. Companies have "data lakes" so big they need to come up with excuses to use the data, or risk somebody pointing out that they're spending a fortune hoarding bits.
Consumer hardware hasn't had a new use case since like 2012. Faster wifi for broadband & local file transfers, and higher-resolution video, are the only reasons one needed newer hardware. We actually got a resolution so high it makes no perceivable difference. And yeah we got faster CPUs and memory, but as soon as we did it got all eaten up by the most inefficient, wasteful software conceivable. Same use cases as 13 years ago, just more expensive, harder to use, and buggier. We should've gotten a new CPU architecture that was faster and more energy efficient. Finally it was delivered, but with a moat around the golden Apple.
Here we are two and a half decades into the Internet era, and my damn bluetooth earbuds and webcam microphone don't work half the time that I open a video conferencing app. Hardware can stay exactly like it is for the next few decades and I'd be happy. I just want software that works, and doesn't get continuously slower, forcing me to buy bigger hardware; or more draconian, locking me out of being able to use it how I want.
The natural progression when performance is enough would be price. We were starting to see that but not anymore. I wonder if somebody is afraid the future where generally useful computation is cheap.
>I feel like part of the reason for the relative stagnation in hardware over the last twenty years was simply the lack of use cases to justify hardware refreshes by businesses.
No, we're running into limits of moore's law, and it's showing in prices for new nodes, where they're getting denser but not cheaper.
It's true we hit limits, but I feel like a lot of it was "limits" in the sense that the tradeoff stopped being worth the cost, so we optimized in other areas.
So we hit limits on clock speed in the early 2000s (ex - the 4ghz wall) but it also turned out that mobile as the driver for sales meant no one really cared much about clock speed compared to performance/watt.
Clock speed mattered, but only relative to how many watts it took to get it (and above 4ghz... too many watts).
But we've seen a 15x improvement over the last 20 years. Performance/Watt is WAY up.
My guess is that LLMs are going to drive another "improvement cycle" in areas that we didn't care much about before.
I've built about 10 personal desktop machines (1 every ~4 years) and I can honestly say that I didn't care much about memory bandwidth prior to 2021.
In the same way that I didn't care much about how many watts my pentium 4 was using in 2005.
But now... now I care a lot about memory bandwidth. I care about memory speeds and total system ram in a manner I really, really didn't before.
So I think we're going to see a big shift to machines built on unified ram with a crazy focus on squeezing memory bandwidth and total ram capacity as far as we can.
My bet is that we'll get a similar 10-15x improvement by 2040 in unified system ram designs.
I fully expect to see 2tb unified ram desktops and 200gb unified ram phones be relatively common on a 20 year timeline, assuming we see similar levels of geopolitical stability (ex - world war 3 throws a wrench into things).
Yeah, even Windows managed to not drive terribly dramatic upgrades in general computing (besides Windows’ absurd RAM usage and now requiring a TPM).
In the old days, Microsoft Entertainment Pack games were somewhat visibly taxing on some lower end systems.
Physical limitation of the manufacturing process may be more significant factor, starting from the TSMC 10nm ten years ago
I’m kinda lost here… do y’all really have machines in your houses with hundreds of gigs of RAM?? Am I just behind the times?
The page advertises the 8-bit quant as taking ~800GB, which seems like it would require at least 3 consumer motherboards fully stacked w/ 4x64GB cards each.
Maybe “locally” has slowly come to imply “…on your homelab”?
DRAM prices at mid-2025 rates were ~$2.5/GB for DDR5, and ~$1.5/GB for DDR4. "Hundreds of gigs" of RAM used to be under $500. 128GB of cheapest RAM used to be like $200. It seemed to go over heads for a lot of people that you could get hypothetical future machines on CS/CE textbooks were attainable for that little, for some reason - there seemed to be some fixation on the idea that 16GB is all you need.
You don't have to have a server, workstation motherboards support lots of memory channels.
I was lucky to buy a lot of RAM before prices skyrocketed. I knew I wanted to play with this stuff, so I spent what felt like a lot of money at the time to buy 8x96GB DDR5-6400 RDIMMs. Now the same RAM costs at least 6x more.
[dead]
As soon as Llama came out I had a realization what was coming and went all-in on hardware with the assumption open source would catch up with GPT4. Surprise, it did, we now have small models that absolutely crush GPT4 in performance.
It wasn’t that absurdly expensive for a hobby, I bought 64GB DDR4 ECC sticks between $70-$100 on eBay before everything took off. Now everyone is in here debating if open source is 1 month or 3 months behind SOTA. The future is obviously local.
I got a 2U rackmount with 192Gi DDR4 for $1.1k USD in 2023. Around 1.5 yrs ago, server RAM could be had pretty cheap--especially slower LRDIMMs (I wanna say 512Gi DDR4 was <$500 USD). I checked a couple old ServeTheHome threads and seeing maybe around $50/32GB RDIMM although thought it was cheaper than that for a little while
RAM wasn't expensive even a year ago. I maxed out a used Dell Precision T5610 with 128 GB DDR3 for $250 in 2021.
I have the RAM, but not the VRAM. What kind of speed/tps could you expect from a 3090 with 24GBs of RAM? I am somewhat tempted to pick a GPU with 24GBs of RAM.
A GPU with 24GBs of RAM is mostly useful for running a very carefully squeezed Qwen3.6 27B (4-bit Unsloth quants, 8-bit K/V cache, possibly MTP, 128k context). This is a fun little model that's smart enough to do debugging, refactoring, and implementing "clean" specs that don't force it to make complicated design choices. I've seen it rip through a 9-year-old Terraform AWS config, and (without using the network) correctly identify nearly everything that would need to be upgraded or migrated for modern AWS. But if I give it some poorly conceived spec with lurking design headaches, then it goes on an endless thinking binge and ultimately fails.
Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.
Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.
Generation is basically just memory bandwidth math.
Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.
And with MTP (or other speculation techniques) you can ~double that.
MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.
Funny I casually asked Gemini and it said 500k for unquantized with decent throughput.
This is why you shouldn't believe uncritically an answer from an LLM (neither should you do for any answer from a human either though).
But I did my research online and the sun cycle is every 11 years and something something global warming is a hoax every single year now.
That's fair for new hardware. You probably want to prompt "homelab" or "used hardware" to compare what's in this thread.
i asked gemini and it replied with "Error: 400 Your prompt was blocked by safety filters. Please revise and try again."
Safety from competition!
I asked and it said “403 forbidden - careful peon attempts to bypass the late stage capitalism api with your monetary offerings in exchange for you daily tokens will get you perma banned right to jail”.
LLMs aren't discrete calcluators or estimators of things unless framed and guided to do so.
Good job I didn't use a vanilla LLM without tool use harness then.