> why do neural networks work better than other models
The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters.
> why do neural networks work better than other models
The only people for whom this is an open question are the academics - everyone else understands it's entirely because of the bagillions of parameters.
No it isn't, and it's frustrating when the "common wisdom" tries to boil it down to this. If this was true, then the models with "infinitely many" parameters would be amazing. What about just training a gigantic two-layer network? There is a huge amount of work trying to engineer training procedures that work well.
The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.
> The actual reason is due to complex biases that arise from the interaction of network architectures and the optimizers and persist in the regime where data scales proportionally to model size. The multiscale nature of the data induces neural scaling laws that enable better performance than any other class of models can hope to achieve.
That’s a lot of words to say that, if you encode a class of things as numbers, there’s a formula somewhere that can approximate an instance of that class. It works for linear regression and works as well for neural network. The key thing here is approximation.
No, it is relatively few words to quickly touch on several different concepts that go well beyond basic approximation theory.
I can construct a Gaussian process model (essentially fancy linear regression) that will fit _all_ of my medical image data _exactly_, but it will perform like absolute rubbish for determining tumor presence compared to if I trained a convolutional neural network on the same data and problem _and_ perfectly fit the data.
I could even train a fully connected network on the same data and problem, get any degree of fit you like, and it would still be rubbish.
That isn't what they are saying at all, lol.
Also massive human work done on them, that wasn't done before.
Data labeling is pretty big industry in some countries and I guess dropping 200 kilodollars on labeling is beyond the reach of most academics, even if they would not care about ethics of that.
normally more parameters leads to overfitting (like fitting a polynomial to points), but neural nets are for some reason not as susceptible to that and can scale well with more parameters.
Thats been my understanding of the crux of mystery.
Would love to be corrected by someone more knowledgable though
This absolutely was the crux of the (first) mystery, and I would argue that "deep learning theory" really only took off once it recognized this. There are other mysteries too, like the feasibility of transfer learning, neural scaling laws, and now more recently, in-context learning.