This article definitely feels like chatgptese.
Also, I don't necessarily feel like the size of LLMs even comes close to overfitting the data. From a very unscientific standpoint it seems like the size of weights on disk would have to meet or exceed the size of the training data (modulo lossless encryption techniques) for overfitting to occur. Since the training data is multiple orders of magnitude larger than the resulting weights, isn't that proof that the weights are some sort of generalization of the input data rather than a memorization?
1) yes it's definitely chatgpt
2) The weights are definitely a generalization. The compression-based argument is sound.
3) There is definitely no overfitting. The article however used the word over-parameterization, which is a different thing. And LLMs are certainly over-parameterized. They have more parameters than strictly required to represent the dataset in a degrees-of-freedom statistical sense. This is not a bad thing though.
Just like having an over-parameterized database schema:
can be good for performance sometimes,The lottery ticket hypothesis as chatgpt explained in TFA means that over-parameterization can also be good for neural networks sometimes. Note that this hypothesis is strictly tied to the fact that we use SGD (or adam or ...) as the optimisation algorithm. SGD is known to be biased towards generalized compressions [the lottery ticket hypothesis hypothesises why this is so]. That is to say, it's not an inherent property of the neural network architecture or transformers or such.