Hacker News

I think the problems with big network were diminishing gradients, which is why we now use the ReLU activation function, and training stability, which were solved with residual connections.

Overfitting is the problem of having too little training data for your network size.