The little qwen36 is at sonnet level . Kimi2.6 is about opus. The one can run on a single GPU on your gaming pc. The other you can run way cheaper from a provider. Or if you are really wealthy and have lots of gpus can run it yourself.
Not sure where deepseek 4 sits
Kimi 2.6 is nowhere near even Sonnet in overall robustness. It can get close when everything goes perfectly.
I have about 1KLOC of harness code written by Kimi to work around quirks in Kimi not needed for any other model I've tested, such as infinite toolcall loops and other weirdness.
You can do quite a bit with it and never run into those quirks, or you might hit it every request.
It is very sensitive to "confusing" things about it's environment in a way Sonnet and Opus are not.
Still great value, but they have some way to go.
Would "lots of gpus" even help for huge models? Maybe this is exposing my lack of knowledge but don't you need to keep the whole model and context in a single GPU's VRAM? My understanding is that multiple GPUs help with scaling (can handle N X inference requests simultaneously) but it doesn't help with using large models. If that were the case, I could jam another GPU in my box and double the size of model I can serve.
> Would "lots of gpus" even help for huge models? Maybe this is exposing my lack of knowledge but don't you need to keep the whole model and context in a single GPU's VRAM?
How do you think the large providers do inference? No single GPU has 1TB plus of memory on board. It’s a cluster of a bunch of gpus.
1t model instances(opus, gpt,etc) are not running on a single GPU. The catch is how the cards communicate and how the model is broken up. There's a bit that goes into it but the answer is yes the more gpus the bigger the model you can run.
Really cool. I'm very much still learning about this stuff. Sounds like this inter-GPU communication is a feature of special hardware (not consumer GPUs).
Ever hear of SLi (now called NVLink)? It's a GPU interconnect that's been available for a good long while now on high-end Nvidia GPU's. I believe AMD's implementation is called Crossfire.
GPU interconnect speeds are a big bottleneck today for GPU's in AI applications. Data can't move between them fast enough.
Not really, there's various ways it can be done but even I think the old 1080tis could do it. Keep reading about it, my interest is in small models on a single GPU though so I don't fuss over those details.
Most consumer cards had faster interlinks included on them until one generations ago when they decided they wanted to differentiate their data center hardware more, And remove the inner links that have been on the cards in various forms for 20 plus years.
Please don't oversell them. Eg Kimi k2.6 has a maximum context size of 270k, that's a quarter of opus.
The model is fine, Ive switched to it entirely for a personal project, but it's not opus.
And no, you're not running then locally unless you're a millionaire. You still need hundreds of GB (500+++) of VRAM on your graphics card - that's not at a level of consumer electronics.
Sure you can run the quantized models, but then you're at Haiku performance.
Whilst I agree with the premise, I think you are actually underselling them.
Claude becomes near lobotomized at beyond 500,000 tokens. I don't believe much quality code gets outputted at such high token counts, not to mentioned drastically increased cost.
270k isn't massive, but its very usable with compaction. Not every task needs the full context history.
Quantized models do have a quality / accuracy impact, although it is not as drastic as you suggest. There is some good data on this [0].
"These findings confirm that quantization offers large benefits in terms of cost, energy, and performance without sacrificing the integrity of the models. "
One thing that is worth mentioning is quant models are not created equally, they are not always scaling at the same rate. [1] For example not all tensors contribute equally to model accuracy. In practice, the most sensitive parts (such as key attention projections) are often quantized less aggressively to preserve the quality of the inference.
[0] - https://developers.redhat.com/articles/2024/10/17/we-ran-ove...
[1]- https://medium.com/@paul.ilvez/demystifying-llm-quantization...
Qwen 3.6 runs in a single GPU. But I mostly agree with you except, just because a model has a given context doesn't mean it's all available or entirely reliable.
You can run the big models in RAM, including via offloading weights from disk. They will be extremely slow on ordinary hardware, but they will run. Hundreds of gigabytes of RAM is a viable purchase for many, and the footprint can be split over multiple nodes with pipeline parallelism. If that's still too slow for the total throughput you expect to need on an ongoing 24/7 basis, that's when it becomes sensible to think about adding discrete GPUs for acceleration.
Yes multiple GPUs absolutely help with inference even for a single model instance. Some models are simply too big to fit on the largest available GPU.
Check out tensor parallelism
Tensor parallelism is not useful on consumer platforms with slow interconnects, unless compute is really low and you prioritize decreasing latency over throughput. pipeline parallelism (and potentially expert parallelism) are more workable.