Hacker News

leumassuehtam 2 days ago [ - ]

People believe that more parameters would lead to overfit instead generalization. The various regularization methods we use today to avoid overfit hadn't been discovered yet. Your statement is mostly likely about this.

Silphendio 2 days ago [ - ]

I think the problems with big network were diminishing gradients, which is why we now use the ReLU activation function, and training stability, which were solved with residual connections.

Overfitting is the problem of having too little training data for your network size.

graemep 2 days ago [ - ]

Possibly, I would have to dig up the book to check. IIRC it did not mention overfitting but it was a long time ago.