RNN worked that way too, the difference is that Transformers are parallelized, which is what made next-word prediction work so good, you could have an input thousands of tokens in length without needing your training to be thousands of times longer.
RNN worked that way too, the difference is that Transformers are parallelized, which is what made next-word prediction work so good, you could have an input thousands of tokens in length without needing your training to be thousands of times longer.