I think a lot of it is the massive amount of compute we've got in the last decade. While inference may have been possible on the hardware the training would have taken lifetimes.

I have a textbook somewhere in the house from about 2000 that says that there is no point having more than three layers in a neural network.

Compute was just too expensive to have neural networks big enough for this not to be true.

Once you have three layers (i.e. one "hidden" layer) then you can map to arbitrary functions, so a three layer network has the same "power" as an arbitrarily large network.

I'm sure that's what the text book meant, rather than any point about the expense of computing power.

People believe that more parameters would lead to overfit instead generalization. The various regularization methods we use today to avoid overfit hadn't been discovered yet. Your statement is mostly likely about this.

I think the problems with big network were diminishing gradients, which is why we now use the ReLU activation function, and training stability, which were solved with residual connections.

Overfitting is the problem of having too little training data for your network size.

Possibly, I would have to dig up the book to check. IIRC it did not mention overfitting but it was a long time ago.