Because reasoning is an emergent byproduct of training it on all knowledge. It still doesn't "know" things in this form and just generates tokens, no matter how weird we spin it.

So if you don't train it on a large dataset of a lot of words with a lot of sensible connections, it won't be able to reason, as it won't be able to make proper connections between words and sentences.

You can try training a really small model and seeing the gibberish outputs when you train it on only a small dataset.

Minmaxing the dataset to extract maximum generation with minimal data does sound like fun, but if you want to build SoTA models as a company, the economic tradeoff of doing that vs slapping a few more GPU's together is terrible.

I think small expert models could be pretty powerful from open weight providers.

Imagine, for example, a model that's primarily train on typescript and general programming. It would be faster to train and it could be a lot smaller than a generalist model. It might be the best model to pick when you are doing typescript programming. And if you could squeeze that into 3B parameters a lot of consumer hardware could run it locally.

You could even expand it to just "webdev tech" or the like.