Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?

[1] https://huggingface.co/zai-org/GLM-5.2

I ran it on my laptop, which is a Lenovo Legion 5i (think 32 GB RAM, 4060 w/ 8 GB VRAM, you get the picture). It was a quantized model (otherwise it would not fit on my NVMe 1TB drive) at 4 bits per weight - UD_Q4_K_XL. It ran at about 12 seconds per token (not tokens per second). A fun project, but not worth it. I used 4096 tokens of context cache, and I ran it with llama.cpp - as it supports memory mapping. Because the whole thing could obviously not fit in RAM, I was curious how much it would need to stream from SSD. The answer? For a simple 4 sentence description of who it was, about 1.5 TiB was streamed from disk.

Thank you for sharing. 1.5TB of streamed data at 12 seconds per token on a high end consumer laptop is a pretty high requirement - I can only imagine how much that cost to train. I don't know how running this model could be cost effective for anybody.

Indeed - definitely not cost effective to run it on this laptop LOL. It makes me wonder how fast we could run the model if we could fit the weights entirely within CPU cache (assuming a whole ton of CPUs with low latency & high speed IO of course).

short answer: they mostly aren't

A few people are running highly quantized models with limited context windows. It's still impressive, but not the benchmark level intelligence. Very few people could afford a rig for reasonable local performance at a reasonable quant, at full context size.

The antirez example is 2.6bit quant, 32k context, and few tokens per second... on a ~$7000 MacBook M5 (new RAM pricing).

Run quantized versions. https://unsloth.ai/docs/models/glm-5.2

follow antirez - https://x.com/antirez/status/2071173841175363905?s=20

Thats quantized

It's a nice technical achievement but looks unusably slow for actual work

8 X RTX6000. It will run you around 80-100k to get started with a model at this size with decent tps..

Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

For $100k you could run this model 24/7 through open router with 10 concurrent sessions at 50tps for a decade and have money left over for a vacation. There's no point in investing this type of money in local models unless you have a business where you're already paying for many employee's individual token usage.

> 8 X RTX6000. It will run you around 80-100k to get started

8 x RTX6000 GPUs cost $100,000 alone. You then need to build a system that can support those GPUs with enough PCIe lanes through a PCIe switch.

It's going to be $120K to $150K to build or buy a system to run this.

Not to mention the three separate dedicated 15A circuits you would need to have installed in order to run the 3x 2000W power supplies running ideally at no more than 1400W sustained load each. And definitely would need 200A service to the house if you have a family living there with you.

But hey you could save on heating?

That’s a uniquely US issue - in NZ you can get a 100A single phase at 230V nominal without any issue. 23kw, straight to your door.

A single circuit using 10mm TPS would technically be enough to run what you’re describing. Might be pricey though, I’d probably take the excuse to get 3 phase installed so I could get access to the stock of used 3 phase machinery.

> That’s a uniquely US issue - in NZ you can get a 100A single phase at 230V nominal without any issue. 23kw, straight to your door

In the US it's common to get 200A 120/240V split-phase service. We're talking about the wiring inside the house, though.

How do you think everyone here is charging their electric cars at home and running our AC and electric cooktops at the same time if we didn't also have that? :)

You need to derate for constant loads here, and I assume you have to do that in NZ as well.

So, no, not a "uniquely US issue".

Not so sure about that. 200amp @ 240v is pretty standard for modern houses in the US. My house in Japan was only 40amps, so there are plenty of countries where this would be an issue.

isn't throwing that into a [insert financial vehicle that gives 99.99999% safe returns] going to destroy that when you factor in electricity costs?

Or even just electricity costs vs token cost

You can run the NV4FP quant with 8x RTX6000 cards at 50-75 tps output, but not (practically speaking) the OEM FP8 version. You will learn more about PCIe than you ever wanted to know.

The real gangstas are running 16x RTX6000s. Too rich for my blood, and the NV4FP quant doesn't seem to be that much worse.

Anyone done any benchmarks on the NV4FP quant? Seriously considering pitching an 8 x RTX 6000 Pro box at work to run GLM-5.2 in an air gapped environment.

At that price point you could also go with a Tenstorrent Galaxy Blackhole, which starts at $110,000.

Ooh, I hadn't seen these yet! That looks quite compelling, my only hesitancy would be what the software support looks like. But 1 TB of memory for $110k is really intriguing - I might go bother a sales rep. Thanks!

Good luck. I’m in the legal field, and even there, selling airgapped is tough.

What are the challenges you've seen in selling air gapped? Is it the high upfront cost? Challenges with hardware maintenance or something else?

We already use AWS. Everyone else is using AWS, so if there's an issue we can just say we were following industry standards.

My issue is we likely can't use AWS (non-US, CLOUD Act concerns + export control concerns).

Depends how much you value privacy and running uncensored models.

Personally, I’m waiting for hardware to hit the secondary market before I buy something to run unquantized models like GLM. But I have no doubt that I will, at some point.

>Don't worry though, open source evangelists will tell you that these will be running on your phone in the next 3 years.

Not sure if you're being sarcastic, but I can run a quantised version of Gemma or Qwen on my 16GB M1 Macbook Pro that beats GPT-4 from 2023 hands-down.

I wouldn't be surprised if, in another 3 years, you'd be able to run something as powerful as Opus 4.5 or GLM-5.2 on standard consumer hardware - say a 32GB/64GB M7 Pro.

I also wouldn't be surprised if, 3 years after that, cheaper hardware and improved model efficiency means that there's a much smaller gap between what you can run on a consumer CPU (which, with memory prices coming down, could look like a 256GB M9 or M10 Pro) and $100k GPU cluster.

This is clearly where the industry is going, imho. Everyone who is playing with LLMs wants a laptop with enough grunt to run a decent model locally.

We've been sat with basically the same PC specs for ~20 years - our current specs are within an order of magnitude of the ones we could buy back in 2010. This is not really constrained by tech, as we could have much, much, larger machines. It's more because there's no mass demand for much, much, larger machines - if it's big enough to run Office apps or VSCode then you're good to go. The exponential growth we saw in the 90's was driven as much by software demand as it was by hardware development.

I can see the next 10 years produce the same kind of push for larger machines that the 90's did. And we should probably expect the same kind of standards churn as our existing technologies for storage, memory, etc, don't scale up enough and new technologies become worth developing because there's demand for them.

It seems relevant for playing with LLMs, but for actual work this seems far off for me.

My productivity profits from the best intelligence available, a decent context size, and a batch size of four.

While my MacBook has 48 GB of RAM, not only do I want the above requirements at a decent speed, but I also need my machine to run the development tools and test suites, ideally without the fans blasting at full load.

For the foreseeable future I will stay with providers rather than local inference, apart from niche use cases.

Yeah, agree, but that's the point, really. If I could buy a 16Tb machine with 4 TPUs for ~$5K and run a frontier model locally, I would.

I'm in Australia, so we're probably not getting access to Fable again. We're learning that a faster model + better harness/framework > smarter model. So being able to run GLM5.2 locally and super-fast would be great.

my only concern if the same specs today would cost 10x more given the trajectory of the growth of memory prices lately.

I think this is where the new technology comes in. There is demand for 10x (or 1000x) the memory that we're using at the moment, so someone/something will satisfy that demand. We haven't had that demand up until now, because 16Gb was a perfectly reasonable amount of memory that could run pretty much anything, and if that won't then 32Gb will. There was zero demand for 16Tb memory machines because no-one had any application for that much memory. Now that's changing, and there is demand for that much, so we'd expect to see that being made available.

But the existing tech we're using for 16Gb probably isn't going to scale to 16Tb at a reasonable price point. And the price point is relatively inelastic - people are used to paying <$5K for their computers, and they're not going to go much above that. You'll get early adopters paying $10K or more for a machine that large, but not the early majority. And even then, obviously, $10K is not going to buy you a 16Tb memory machine.

So there's room for a new technology to come in, where there wasn't previously. This is what happened all through the 90's, and we churned through a bunch of standards and technologies to try and keep up with demand.

> memory prices coming down

Are they?

I suspect AI labs are buying stuff not just for their own use, but to make local use too expensive to be an option :-( And they can always make the "best" frontier model even bigger (though only fractionally better) so it's always out of reach of local use, while consumer laptops have nearly the same amount of memory they had a decade ago.

    m                  o
    o
    d
    e
    l             o
    s
    i        o
    z    o
    e  2020 2022 2024 2026
    
    
    c                  
    h
    e
    a
    p             o      
    R        o     
    A    o                
    M                   o
       2020 2022 2024 2026

For most tasks, I don't value the LLMs based on their absolute capabilities. I wouldn't want to use GPT-4 today even if it's free.

I'm being very sarcastic, local model evangalists seems to just be operating on vibes when they say these things and are completely disconnected from how models work, what the hardware requirements are.

Prices aren't going down, and consumer platforms are being shipped with less RAM so we can be sold cloud products. This isn't going to happen.

Can you please explain to me how you're going to fit 700bb-1T params in 64GB of RAM? You realize there are memory requirements proportional to model size?

> Can you please explain to me how you're going to fit 700bb-1T params in 64GB of RAM?

You don't. What they're saying is that today's small models (that fit on consumer hw) are better than yesteryear's top models. GPT4 was reportedly 8x 220B (~1.6T) MoE, and today you can run a 30-120B model that beats it handedly in real-world tasks.

Similarly for 4-20B models beating GPT3 (175B) and so on.

There is a sweetspot of "good enough" that the small models can reach, where you get equivalent tasks solved fully locally. They'll never touch SotA, but they'll reach 2-3-4 year's SotA. Which, depending on the task you need, it can be "good enough".

Given GLM is open weight - all you need is one company to take the taalas approach ( model on hardware ), and you're sorted right?

https://taalas.com/products/

Yeah I completely agree. But this is much larger model than the 8B one they put on a chip, so that's probably an engineering challenge for now. Also, how expensive would it be?

No idea - AI tells me under 30 dollars per unit for the ROM with development costs in the low 10's of millions.

If that's anywhere near right then it seems like a no brainer.

Would you be better off pooling that money with some hackerspace group and then setting up shared inference infra, so that way you at least get better utilization?

And before you know it, you invented some openrouter provider from first principles...

Right. For example you will need to figure out how to share it and who maintains it.

You can then rent spare capacity out to people on a subscription or token basis ….wait

How do the economics of your statement work out? Clearly inference providers don't have a time to ROI of 10 years on their hardware costs; and that's without even taking ongoing energy costs into account. What's missing here?

Output tokens are actually kinda expensive for the provider.

The input cache hit tokens are incredibly cheap for them, (incredibly high margin too, except for deepseek).

And input tokens are in the middle. Input tokens can be processed very efficiently.

Also his math is wrong. $100k gets you 22.7B output tokens at $4.4/M which is how much GLM 5.2 costs.

At 500/s 22.7B is just 500 days. Or about 1.54 years. Which is much less then the life of the hardware.

The inference providers are running batch sizes much larger than 10

Inference providers have been getting a firehose of investor cash to keep the chips running (and are looking around very nervously as that firehose starts to sputter).

you can however, have fun with it.

oil workers buy 100k trucks they do not-much with. why not a 100k in computer?

Yea as far has hobbies go, I feel like this is on the low end. I know people who collect watches and corvettes, that's way more expensive and functionally you can't really do anything special with them.

The difference is watches and corvettes typically appreciate in value, where as computer hardware typically drops like a rock.

> watches

Some, and the market fluctuates a ton.

> corvettes

Only the oldest, most unique model years: nobody is buying (C4-C5-realistically C6) mid-90s or early 2000s Corvettes for more than what they paid for them, and they never will.

Corvettes don't appreciate in value, and high end data center hardware isn't dropping in value anymore. A100s are more than 2 dollars an hour, more than they cost in 2023.

Also LLMs are mainly used for work and if you can spend 6 digits on watches your likely financially independent.

> The difference is watches and corvettes typically appreciate in value

Both of those things' value drops like a rock as soon as you buy them and, at least for cars, they don't all appreciate. Most don't. Even so, they appreciate at an incredible slow rate.

I can't speak for watches but I'd be surprised if it wasn't the same situation.

At least the gpus can create value after you buy them before they are worthless.

hmm ok let's build a state of the art from 2021 homelab using 2x Epyc Milan chips + DDR4 RAM and lmk how much it costs...

I can't help but ask where this comment came from, you must have some exposure..

It is so easy to spend $100K on a pickup truck these days, it's not even funny.

A Honda minivan is > 50k.

Factory F350 Platinum is at least 90k sticker.

Yet Ford claims it is impossible to sell any pickups for > $60K, so they killed the lightning.

I assume (since they claim they are selling the batteries to AI data centers), they’ll produce some sort of EV >= F150 once the bubble pops, and we get a new president.

Automotive EE here… every other decision about vehicles is about emissions. CAFE, the reason that a company releases X model is that they can then sell more Y models that get worse mileage.

EV is a separate thing. Vastly overmarketed for the technology as it exists today.

Because car loans can’t be used to buy computers

Surprising that the banking industry has not come up yet with the AI native consumer product loan for GPUs.

Probably a bit niche at the moment really. The only people interested in that are us nerds, and the product segment is very adhoc - especially for the local crowd where an epyc, with a bunch of pcie riders and some 3090s on a steel frame is considered optimal

Paging Mr. Son. Mr. Son, please pick up line 3.

And there's your idea. If you could find a way to get people to add another $500/month over 80+ months to an auto loan, dealers would eat that up like filet mignon.

Sure, If you want to light money on fire for entertainment, more power to you. There's probably worse ways to light 100k on fire. If I have an extra 100k laying around it's going to my family though.

As an individual I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag. Once they start trimming down the excess and making them field focused they will run just fine on people's individual devices.

> I do not need the whole model. I don't need the model to have knowledge of the rain history of Algeria nor how many colors are in the Russian flag

Isn’t the performance gap between quantized and full models indicative that even if you aren’t using it directly, the model knowing the colors in the Russian flag does have something to do with the intelligence you demand?

Do quantized models specifically prune out specific knowledge? I think they just compress things down but they're still in there. You'd most likely need to do that when you're doing the initial model training, but I'm not expert.

> they just compress things down but they're still in there

The compression is almost certainly in part specific knowledge getting fuzzed.

Yeah, but it's everything getting fuzzed, including the parts you care about.

Sure. There is a legitimate question around whether one can selectively excise “useless” knowledge. My guess is you can’t. The act of learning it encodes both the act of learning and the knowledge per se. The former is the power of the LLM. (I personally force mine to double check everything instead of going off memory.)

Quantizing is one thing. But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.

Likewise, LLMs do not violate the laws of information theory, and therefore the only way to encode X amount of information in Y amount of bits where X > Y is by performing what is effectively lossy compression, and as X grows larger relative to Y the compression ratio must change to lose ever more information.

Yes, for the sake of making chatbots that are "conversational" in that they can interpret natural language as input and produce code as output you can easily benefit in incidental and unintuitive ways by training it on more natural language text. But for a given fixed parameter size, it's possible to produce a better model for a specific task by selectively not muddying its training set in the first place with things that are likely irrelevant to the task.

>But in general it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability, otherwise you'd have AGI just from reinforcing your model on memorizing the first 10^50 digits of pi.

It's hardly self-evident, and your counter-example is hardly applicable.

The first 10^50 of pi is not the same as having BREADTH of information in the training data, which is the whole point not just any random "information that is irrelevant to your use case".

not to mention that the first 10^50 digits of pi compress to quite small formula, so not much information there to begin with from a shannon/kolmogorov perspective

It is self-evident. Bringing up Kolmogorov complexity is irrelevant, we're talking about rote memorization, but if you can't ignore the given example then replace "digits of pi" with "bits of output from a true random number generator". There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.

"rote memorization" is not the right way to describe how an LLM works.

The memorization of say 100000 world facts through training texts, which enrich model associations all around, is absolutely not the same as rote memorization on 10^50 digits of pi. Not for a human, and even more so, not for an LLM.

An LLM trained with digits of pi and one trained with books and posts, even if they both have the exact same amount of bytes of training input, would not be comparable in any way in utility and reasoning capabilities.

>There's an infinite amount of information that we could shove into a model, and a finite amount of bits with which to store any of that information such that it can be usefully recalled or form useful logical associations.

Which is irrelevant. Anyway, the amount of information that doesn't form useful logical associations is even larger (e.g. actual human books vs possible permutations of characters and spaces). Just like those (random) possible permutations of characters aren't good for LLM input to get logical associations out of it, pi isn't either (logical associations of the kind we care for and expect, not of the kind related to pi's sequences).

Also it's not only not self-evident, it's also apparently wrong.

> actual human books vs possible permutations of characters and spaces

You're making the assumption that anything produced by a human necessarily contains more useful information than random noise does. This is false. Even when only considering human intelligence, it's entirely possible to absorb information that makes you stupider, not smarter; learning is only valuable if you actually learn the right things.

>Even when only considering human intelligence, it's entirely possible to absorb information that makes you stupider, not smarter

I'd say this exchange is a fine example of that :)

> it's self-evident that training the model on information that is irrelevant to your use case does not necessarily improve ability

We don’t understand AI or natural intelligence well enough to make such statements. As for self evidence, cross-domain competence in humans and the rise of generalist models over domain-specific ones (on competence, not cost) seems to pretty directly tank your hypothesis.

> We don’t understand AI or natural intelligence well enough to make such statements.

If you believe this then you don't understand AI or natural intelligence well enough to refute my statements either.

Perhaps you're trying to refer to something specific by "cross-domain" competence, but firstly, humans vastly overestimate the extent to which experts in one domain can be trusted to speak accurately on topics in other domains (this is a form of authority bias), and secondly, real cross-domain expertise is a result of pre-existing metacognitive ability such as keen reasoning ability, intense focus, and learning-how-to-learn. In other words, Leonardo da Vinci was not a genius because he was a polymath; he was a polymath because he was a genius.

Likewise, I see no evidence that "generalist models" have proven anything about their ability over domain-specific ones other than that the big AI firms seem to believe that "generalist models" are their golden ticket to AGI and therefore a quintillion-dollar valuation. It's obvious in the long run that tools built for specialized tasks will outperform generalist tools for specific tasks, in the same way that a multi-axis CNC mill does not outperform your bog-standard lathe for shaping objects with rotational symmetry, or perhaps more pertinently to this conversation, how no LLM will ever outperform Stockfish at chess.

Apparently irrelevant data can help because model weights are entangled.

Yeah, the neoclouds and hyperscalers are taking massive losses right now, self hosting is basically signing yourself up to do the same. There are philosophical reasons to do so but it’s a terrible economic decision

Or you have data that HIPAA, GDPR, PII, or have to care about the concern of others training on your data.

That too.

> 50tps for a decade

assuming demand doesn't keep on increasing. even google has trouble having enough capacity apparently.

[dead]

[deleted]
[deleted]