It is a lot cheaper to train a 27b model such as qwen3.6 which you can even vibe code or agentic code with than it is to train a 1t+ parameter model. It runs on a single commodity GPU for goodness sake

It's not a theory. These smaller models that are coming out are huge advances for the field.

I can't comment on companies training practices. That would be proprietary stuff I guess. I think the claims that the advances being made are due to distillation alone are completely unfair. The advances alone are not just data.