I believe they mean distribution (inference). The Chinese model is currently B.Y.O.GPU. The American model is GPUaaS

Why is inference less attainable when it technically requires less GPU processing to run? Kimi has a chat app on their page using K2 so they must have figured out inference to some extent.

That entirely depends on the number of users.

Inference is usually less gpu-compute heavy, but much more gpu-vram heavy pound-for-pound compared to training. General rule of thumb is that you need 20x more vram for training a model with X params, than for inference for that same size model. So assuming batch size b, then serving more than 20*b users would tilt vram use on the side of inference.

This isn't really accurate; it's an extremely rough rule of thumb and ignores a lot of stuff. But it's important to point out that inference is quickly adding to costs for all AI companies. Deepseek claims that they used $5.6mil to train Deepseek R1; that's about 10-20 trillion tokens at their current pricing- or 1 million users sending just 100 requests at full context size.

> it technically requires less GPU processing to run

Not when you have to scale. There's a reason why every LLM SaaS aggressively rate limits and even then still experiences regular outages.

tl;dr the person you originally responded too is wrong.

That's super wrong. A lot of why people flipped out about Deepseek V3 is because of how cheap and how fast their GPUaaS model is.

There is so much misinformation both on HN, and in this very thread about LLMs and GPUs and cloud and it's exhausting trying to call it out all the time - especially when it's happening from folks who are considered "respected" in the field.