I’ve been wanting to get better acquainted with local inference but I don’t have the hardware, which has made me think about something I haven’t seen discussed, which is local collaboratives. The economics makes it seem like a group of people joining together to run good hardware and an open model might make sense, but I haven’t seen anything like this mentioned. Have I been missing it?
I think it would be pretty neat to launch a service helping people who wanted to participate in something like that locate one another.
The reason you don't see more of this is because everyone does the math, realizes it's not a good deal, and then gives up on the idea.
There's a post at the top of /r/localllama about this exact math right now: https://www.reddit.com/r/LocalLLaMA/comments/1ubrcwj/tokenom...
TL;DR: Running GLM 5.2 is going to cost about $20K minimum, and that's going to be painfully slow compared to the cloud hosted versions. Even the estimates where the server is computing tokens 24/7 you can't break even for several years.
The only reason to run locally is if complete data privacy is your top concern. You pay a high premium for that.
If you invest the minimum to run the model, obviously that's more expensive per-token than investing the optimum to get the best price/performance tradeoff (which for GLM 5.2 is at least five times that figure)
If you can bring the load to run the model on close to optimal hardware 24/7 with multiple concurrent requests, and have reasonably cheap power and AC, you would break even in a reasonable timespan. Which won't happen unless you are self-hosting for a medium-sized company. I guess you could sell your spare capacity to get better utilization ... and we've reinvented hosted inference
I mean sure, I’d you’re attempting to run the biggest possible models, it’s going to require a stupid amount of compute? I thought we all knew this?
The appeal to me is that we can run that, but we can also run smaller models on your laptop _and it’s functional!_ I can run DeepSeek v4 flash and a qwen 3.6 on my laptop! Thats crazy good.
.. conversely, all the cloud LLMs are being subsidized by their investors in addition to massive economies of scale.
It is false to say that all cloud LLMs are subsidized. The open weights models are hosted through numerous third party providers on OpenRouter that are operating as hosting businesses. They aren’t spending investor money to provide tokens for you at below-cost rates. They’re operating as hosting businesses.
economies of scale are enough to explain the entire price difference. Running 8 concurrent requests at 100 token/s on $100k hardware is a lot cheaper than running one concurrent request at 20 token/s on $20k hardware
https://news.ycombinator.com/item?id=48524387
There are plenty of providers of open models that offer very affordable rates. Generally, I recommend looking at OpenRouter since they track various metrics for the various providers.
Open models hosted in Cloud???
AWS Bedrock hosts Gemma 4 31B and this is The Best Deal – hands down. Try it. Vertex also has Gemma 4 MoE version. Not "lobotomised" by quants. There are also GLM (latest) and Qwen / DS (but these two are not latest versions)