Is there any indication of what compute resources this will actually require (in its various incarnations)? Does it incorporate any of the optimisations pioneered by Google (such as TurboQuant, MTP) or some other original innovations to make the frontier quality realistically available to local users?
The GLM-5 series is 744B-A40B. This is not a local model for any reasonable definition of local, but it's an open model which means (once they upload the weights in a week or so) there will be a dozen third-party inference providers competing on price per token.
> This is not a local model for any reasonable definition of local
That's true for now. I am hopeful that once the hardware markets have recovered from OpenAI's sabotage, we will see more hardware dedicated to local inference that can handle these big models.
Also, I'm thinking about the unique MoE routing that Apple is using with their new Apple Foundation Model. The model is trained and architected so that experts are not swapped for every token, but only occasionally. This suggests that e.g., a 744B parameter model in the future could have experts offloaded to SSD and still run with the effective computing requirements of a 40B model.
Reading weights out of memory is the definition of a large linear read. I'm a bit mystified someone hasn't put an embarrassingly parallel flash storage controller next to some tensor processors on a PCIe card. It could have 4Tb of flash hanging off enough channels to saturate SRAM skipping DRAM entirely, and could even offload prompt processing to a GPU in the same workstation so long as it got reasonable tokens/s in inference. I'd buy one tomorrow.
For the last year, there has been development work at several companies for products including HBF (high-bandwidth flash memory) as a supplement to HBM, in order to enable running inference for big LLMs at a reasonable cost, e.g. on one GPU-like card.
HBF was initially announced by SanDisk, early in 2025, then early this year Hynix has announced that they have joined SanDisk in producing HBF, and that the common specification will be standardized under the Open Compute Project.
With HBF, it would be easy to make a GPU card with 4 TB of HBF, which could run the biggest existing open weights LLMs in their native unquantized form.
Exciting news! This is how I see running frontier models at home becoming reasonably affordable. Though it may take a depreciation cycle or two.
For sparse MoE models, the single expert layers that the inference gets sampled from are actually quite small - single-digit megabytes or so.
Is there reason to expect the consumer hardware markets to recover any time soon?
Is there reason to expect they’ll ever recover without an AI bust that takes down the U.S. economy?
I don't think it'll ever recover. Partially perhaps. But we have bigger problems to worry about really.
Normally, experts are picked for every layer not just every token. But there are plausible ways of getting around that bottleneck while streaming if you can batch many inferences together. Still, the Apple approach of swapping the experts only rarely is interesting, though it likely degrades the model a lot.
Just get the bigger models to figure out the architecture required for hot-swappable sub-experts without loss of performance!
Got all those tokens, isn’t that the point of auto research and friends??
(Only sort of joking).
As far as I can tell this type of model requires 640GB+ of memory using FP8. So likely can be run using 320GB+ memory if using FP4 or similar. So that would be 3 Nvidia DGX Sparks, or 12k of hardware. Is that correct? If so, it could make perfect sense for a small business.
The performance would be abysmal spread across four Sparks, I'd think, though I guess MoE mitigates that somewhat. Still better to just pay for it in the cloud. (Though I've spent about $4k on local compute for AI experimentation, I don't think it pays for itself, I just like tinkering.)
You probably need four of them in practice.
[dead]
If you have 80k in hardware you can run it.. There is not such thing as an effective local model that runs on consumer hardware, anybody telling you otherwise is lying, delusional. JuSt a FeW MoRe ReLeAsEs
> effective
Depends on the task.