Hacker News

I think there is a lot of progress on efficient useful models recently.

I've seen GLM-4.6 getting mention for good coding results from a model that's much smaller than Kimi (~350b params) and seen it speculated that Windsurf based their new model on it.

This Kimi release is natively INT4, with quantization-aware training. If that works--if you can get really good results from four-bit parameters--it seems like a really useful tool for any model creator wanting efficient inference.

DeepSeek's v3.2-Exp uses their sparse attention technique to make longer-context training and inference more efficient. Its output's being priced at 60% less than v3.1 (though that's an imperfect indicator of efficiency). They've also quietly made 'thinking' mode need fewer tokens since R1, helping cost and latency.

And though it's on the proprietary side, Haiku 4.5 approaching Sonnet 4 coding capability (at least on benches Anthropic released) also suggests legitimately useful models can be much smaller than the big ones.

There's not yet a model at the level of any of the above that's practical for many people to run locally, though I think "efficient to run + open so competing inference providers can run it" is real progress.

More important it seems like there's a good trendline towards efficiency, and a bunch of techniques are being researched and tested that, when used together, could make for efficient higher-quality models.