Hacker News

RNN worked that way too, the difference is that Transformers are parallelized, which is what made next-word prediction work so good, you could have an input thousands of tokens in length without needing your training to be thousands of times longer.