Rumelhart et al wrote "Parallel Distributed Processing"; there's a chapter where he proves that the backprop algorithm maximizes "harmony", which is simply a different formulation of error minimization.

I remember reading this book enthusiastically back in the mid 90s. I don't recall struggling with the proof, it was fairly straightforward. (I was in senior high school year at the time.)