Hacker News

On the contrary, AI training techniques require gigantic amounts of data to do anything, and there is no upper limit whatsoever - the more relevant data you have to train on, the better your model will be, period.

In fact, the biggest thing that is making it unlikely that LLM scaling will continue is that the current LLMs have already been trained on virtually every piece of human text we have access to today. So, without new training data (in large amounts), the only way they'll scale more is by new discoveries on how to train more efficiently - but there is no way to put a predictable timeline on that.

ijk 5 days ago [ - ]

Ironically, scaling limits and evidence that quality vastly outweighs quantity suggests that all that web data is much less useful than buying and scanning a book. Most work with the Common Crawl data, for example, has ended up focusing on filtering out vast amounts of data as being mostly useless for training purposes.

There was a hot minute in 2023 where it looked like we could just data and compute scale to the moon. Shockingly, it turns out there are limits to that approach.