Hacker News

In some sense, isn't this overfitting, but "hidden" by the typical feature sets that are observed?

Time and time again, some kind of process will identify some simple but absurd adversarial "trick stimulus" that throws off the deep network solution. These seem like blatant cases of over fitting that go unrecognized or unchallenged in typical life because the sampling space of stimuli doesn't usually include the adversarial trick stimuli.

I guess I've not really thought of the bias-variance tradeoff necessarily as being about number of parameters, but rather, the flexibility of the model relative to the learnable information in the sample space. There's some formulations (e.g., Shtarkov-Rissanen normalized maximum likelihood) that treat overfitting in terms of the ability to reproduce data that is wildly outside a typical training set. This is related to, but not the same as, the number of parameters per se.