Interesting article, is it concluding that different small networks are formed for different types of problems that we are trying to solve with the larger network?

How is this different from overfitting though? (PS: Overfitting isn't that bad if you think about it, as long as the test dataset or inference time model is trying to solve problems in the supposedly large enough training dataset)