With these things it’s always both at the same time: these super grandiose SOTA models are only making improvements mostly because of optimizations, and they’re just scaling our as far as they can.

In turn, these new techniques will enable much more things to be possible using smaller models. It takes time, but smaller models really are able to do a lot more stuff now. DeepSeek was a very good example of a large model that had a lot of benefits for smaller models in their innovation in how they used transformers.

Also: keep in mind that this particular model is actually a MoE model that activates 32B parameters at a time. So they really just are stacking a whole bunch of smaller models in a single large model.