Hacker News

The idea that simply having a lot of parameters leads to overfitting was shown to not be the case over 30 years ago by Vapnik et al. He proved that a large number of parameters is fine so long as you regularize enough. This is why Support Vector Machines work and I believe has a lot to do with why deep NNs work.

The issue with Vapnik's work is that it's pretty dense and actually figuring out the Vapnik-Chervonekis (VC) dimension etc is pretty complicated, and one can develop pretty good intuition once you understand the stuff without having to actually calculate, so most people don't take the time to do the calculation. And frankly, a lot of the time, you don't need to.

There may be something I'm missing completely, but to me the fact that models continue to generalize with a huge number of parameters is not all that surprising given how much we regularize when we fit NNs. A lot of the surprise comes from the fact that people in mathematical statistics and people who do neural networks (computer scientists) don't talk to each other as much as they should.

Strongly recommend the book Statistical Learning Theory by Vapnik for more on this.