So I ran 16x A100 in GCP for training workloads. And it was hard to keep it running for more than a few days so that matches my number.
However I think a lot of it is driver or some software issue. I remember switching from pytorch docker image to Nvidia's NGC images and the reliability increased very noticeably. Do you have the data for popular docker images?