Hacker News

Please don't oversell them. Eg Kimi k2.6 has a maximum context size of 270k, that's a quarter of opus.

The model is fine, Ive switched to it entirely for a personal project, but it's not opus.

And no, you're not running then locally unless you're a millionaire. You still need hundreds of GB (500+++) of VRAM on your graphics card - that's not at a level of consumer electronics.

Sure you can run the quantized models, but then you're at Haiku performance.

HDBaseT 13 hours ago [ - ]

Whilst I agree with the premise, I think you are actually underselling them.

Claude becomes near lobotomized at beyond 500,000 tokens. I don't believe much quality code gets outputted at such high token counts, not to mentioned drastically increased cost.

270k isn't massive, but its very usable with compaction. Not every task needs the full context history.

Quantized models do have a quality / accuracy impact, although it is not as drastic as you suggest. There is some good data on this [0].

"These findings confirm that quantization offers large benefits in terms of cost, energy, and performance without sacrificing the integrity of the models. "

One thing that is worth mentioning is quant models are not created equally, they are not always scaling at the same rate. [1] For example not all tensors contribute equally to model accuracy. In practice, the most sensitive parts (such as key attention projections) are often quantized less aggressively to preserve the quality of the inference.

[0] - https://developers.redhat.com/articles/2024/10/17/we-ran-ove...

[1]- https://medium.com/@paul.ilvez/demystifying-llm-quantization...

2ndorderthought a day ago [ - ]

Qwen 3.6 runs in a single GPU. But I mostly agree with you except, just because a model has a given context doesn't mean it's all available or entirely reliable.

zozbot234 17 hours ago [ - ]

You can run the big models in RAM, including via offloading weights from disk. They will be extremely slow on ordinary hardware, but they will run. Hundreds of gigabytes of RAM is a viable purchase for many, and the footprint can be split over multiple nodes with pipeline parallelism. If that's still too slow for the total throughput you expect to need on an ongoing 24/7 basis, that's when it becomes sensible to think about adding discrete GPUs for acceleration.