But regarding this Lottery Ticket hypothesis, what it means is that a small percentage of the parameters can be identified such that: when those parameters are taken by themselves and reset to their original pre-training weights, and the resulting network is trained on the same data as the parent, it performs similarly to the parent. So in fact, it seems that far fewer parameters are needed to encode predictions across the Internet-scale dataset. The large model is just creating a space in which that small "rock star" subset of parameters can be automatically discovered. It's as if the training establishes a competition among small subsets of the network, where a winner emerges.
Perhaps there is a kind of unstable situation whereby once the winner starts to anneal toward predicting the training data, it is doing more and more of the predictive work. The more relevant the subset shows itself to the result, the more of the learning it captures, because it is more responsive to subsequent learning.