The cost of local hardware is amortized if a whole team uses it instead of just 1 dev (GPUs are extremely underutilized if you launch just 1 generation stream). I'm not sure why everyone always assumes solo devs with Macs. We've just ordered a large datacenter-grade node for use by the whole dev team, and the calculations show that it's going to cost the same amount of money if we kept using AWS Bedrock (infosec reasons) for a couple years but... it gives us 100% privacy, we're immune to all the AI regulation dramas in the US/EU, all the random outages, and the developers won't have to think about token limits/weekly caps etc. ever again. And all that with a model which is Opus-grade

(it's not our first AI server, we already have experience deploying LLMs for our clients, so the numbers look solid)

Yes but unfortunately a lot of the discussion that people participate in, are not done from a corporate point of view, but from a normal consumer level.

And there is a lot of drama in those discussions. GLM 5.2 is a great model for corporations to run, but people only want to hear about running a 35B/27B or maybe a 120B model. And in that market, subscription services are simply way better value for money (take in account the privacy issues).

Everybody wants GPT 5.5/Opus 4.8 Max levels, on a budget that simply is not realistic. And GLM fit in that 4.8 medium/low level.

But then people do not want to be told that running a 750b model in Q2 or Q1 is just going to destroy the models accuracy. And that is still going to cost them 5k+ for that reduced model.

The whole local llm landscape from a consumer point of view, is just filled with odd people. lol.

Corporation really benefit from those models, because spending $90k on a server, is a deductible expense. And they are billed at token prices anyway from all the major providers. So its a even faster ROI on that hardware.

I am surprised that nobody figured out to make a business of selling leftover capacity from corporate llm installations, because there is easily 12h+ just wasted (unless its a large corp that has people in all timezones).

> GPUs are extremely underutilized if you launch just 1 generation stream

why is that? b/c the thing is waiting for the hoooman and idling? or some parallelizable interleaving steps?

I have no intuition yet how this works under the hood.

Some of the inference engines can process multiple requests in parallel more efficiently than doing them sequentially. Not sure of the exact mechanism but e.g. llama.cpp's llama-server can do this (you tell it the number of slots to have when starting, then fire HTTP requests at it and it batches them together when it can).

Waiting for the hooman (or tool calls) won't help either, of course.

The mechanism is that generating tokens (the "decode" phase) in an LLM is limited by memory bandwidth for the weights, so computing multiple streams amortizes the bandwidth over streams as long as you can keep the contexts in RAM. This is most true for dense models and the always-on expert in MoE models, or when you have significantly more streams than experts for MoE models.

In contrast, prompt prefill is more easily compute-bound, so there are interesting trade-offs for latency of decode vs prefill when the LLM utilization is high.