Hacker News

1) yes it's definitely chatgpt

2) The weights are definitely a generalization. The compression-based argument is sound.

3) There is definitely no overfitting. The article however used the word over-parameterization, which is a different thing. And LLMs are certainly over-parameterized. They have more parameters than strictly required to represent the dataset in a degrees-of-freedom statistical sense. This is not a bad thing though.

Just like having an over-parameterized database schema:

  quiz(id, title, num_qns)
  question(id, text, answer, quiz_id FK)

can be good for performance sometimes,

The lottery ticket hypothesis as chatgpt explained in TFA means that over-parameterization can also be good for neural networks sometimes. Note that this hypothesis is strictly tied to the fact that we use SGD (or adam or ...) as the optimisation algorithm. SGD is known to be biased towards generalized compressions [the lottery ticket hypothesis hypothesises why this is so]. That is to say, it's not an inherent property of the neural network architecture or transformers or such.