Why does 'next-word prediction' explain why huge models work? You saying we needed scale, and saying we use next-word prediction, but how does one relate to the other? Diffusion models also exist and work well for images, and they do seem to work for LLMs too.

I think it's the same underlying principle of learning the "joint distribution of things humans have said". Whether done autoregressively via LLMs or via diffusion models, you still end up learning this distribution. The insight seems to be the crazy leap that this is A) a valid thing to talk about and B) that learning this distribution gives you something meaningful.

The leap is in transforming an ill-defined objective of "modeling intelligence" into a concrete proxy objective. Note that the task isn't even "distribution set of valid/true things", since validity/truth is hard to define. It's something akin to "distribution of things a human might say" implemented in the "dumbest" possible way of "modeling the distribution of humanity's collective textual output".

[deleted]

To crack NLP we needed a large dataset of labeled language examples. Prior to next-word prediction, the dominant benchmarks and datasets were things like translation of English to German sentences. These datasets were on the order of millions of labeled examples. Next-word prediction turned the entire Internet into labeled data.