How do they determine the cause of failure in a things like this?

Lots and lots of telemetry.

a lot of sensor