To crack NLP we needed a large dataset of labeled language examples. Prior to next-word prediction, the dominant benchmarks and datasets were things like translation of English to German sentences. These datasets were on the order of millions of labeled examples. Next-word prediction turned the entire Internet into labeled data.