You want more research on small language models? You're confused. There is already WAY more research done on small language models (SLM) than big ones. Why? Because it's easy. It only takes a moderate workstation to train an SLM. So every curious Masters student and motivated undergrad is doing this. Lots of PhD research is done on SLM because the hardware to train big models is stupidly expensive, even for many well-funded research labs. If you read Arxiv papers (not just the flashy ones published by companies with PR budgets) most of the research is done on 7B parameter models. Heck, some NeurIPS papers (extremely competitive prestigious) from _this year_ are being done on 1.5B parameter models.
Lack of research is not the problem. It's fundamental limitations of the technology. I'm not gonna say "there's only so much smarts you can cram into a 7B parameter model" - because we don't know that yet for sure. But we do know, without a sliver of a doubt, that it's VASTLY EASIER to cram a smarts into a 70B parameter model than a 7B param model.
It's not clear if the ultimate SLMs will come from teams with less computing resources directly building them, or from teams with more resources performing ablation studies etc on larger models to see what can be removed.
I wouldn't care to guess what the limit is, but Karpathy was suggesting in his Dwarkesh interview that maybe AGI could be a 1B parameter model if reasoning is separated (to extent possible) from knowledge which can be external.
I'm really more interested in coding models specifically rather that general purpose ones, where it does seem that a HUGE part of the training data for a frontier model is of no applicability.
That’s backwards. New research and ideas are proven on small models. Lots and lots of ideas are tested that way. Good ideas get scaled up to show they still work on medium sized models. The very best ideas make their way into the code for the next huge training runs, which can cost tens or hundreds of millions of dollars.
Not to nitpick words, but ablation is the practice of stripping out features of an algorithm or technique to see which parts matter and how much. This is standard (good) practice on any innovation, regardless of size.
Distillation is taking power / capability / knowledge from a big model and trying to preserve it in something smaller. This also happens all the time, and we see very clearly that small models aren’t as clever as big ones. Small models distilled from big ones might be somewhat smarter than small models trained on their own. But not much. Mostly people like distillation because it’s easier than carefully optimizing the training for a small model. And you’ll never break new ground on absolute capabilities this way.
> Not to nitpick words, but ablation is the practice of stripping out features of an algorithm ...
Ablation generally refers to removing parts of a system to see how it performs without them. In the context of an LLM it can refer to training data as well as the model itself. I'm not saying it'd be the most cost-effective method, but one could certainly try to create a small coding model by starting with a large one that performs well, and seeing what can be stripped out of the training data (obviously a lot!) without impacting the performance.
ML researchers will sometimes vary the size of the training data set to see what happens. It’s not common - except in scaling law research. But it’s never called “ablation”.