A few critiques:

- If you have a feature detector function (f(x) = 0 when feature is not present, f(x) = 1 when feature is present) and you train a network to compute f(x), or some subset of the network "decides on its own during training" to compute f(x), doesn't that create a zero set of non-zero measure if training continues long enough?

- What happens when the middle layers are of much lower dimension than the input?

- Real analyticity means infinitely many derivatives (according to Appendix A). Does this mean the results don't apply to functions with corners (e.g. ReLU)?