I think it doesn't have to follow. You could also generalize the idea and see learning as successfully being able to "use the past to predict the future" for small time increments. Next-word prediction would be one instance of this, but for humans and animals, you could imagine the same process with information from all senses. The "self-supervised" trainset is then just, well, life.