Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.
Component Type MTBF (yrs) AFR
─────────────────────────────────────────────────────────
SSD Hardware ~100 ~1%
RAM uncorrectable error Hardware ~75 ~1-4%
NVIDIA A100 critical error† Hardware 0.18 (65d) -
NVIDIA H100 critical error† Hardware 0.15 (50d) -
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.
So I ran 16x A100 in GCP for training workloads. And it was hard to keep it running for more than a few days so that matches my number.
However I think a lot of it is driver or some software issue. I remember switching from pytorch docker image to Nvidia's NGC images and the reliability increased very noticeably. Do you have the data for popular docker images?
> operating too close to the operational limit, tipping over it, and then requiring a power cycle.
GPUs--they're just like us!
I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower.
Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.
GPU servers always have had crap reliability compared to a normal server (but sticking eight GPUs on a baseboard complicates things). As I understand it (not my domain), this (being a lack of widespread checkpointing and mpift support) is one of the motivating factors for why ML toolkits eschew MPI (besides accelerator-accelerator being an afterthought).
If you rebooted every server after 35 days, would that get rid of many of the problems?
It's an average time to failure, not a guarantee. Failures occur randomly.
I'm curious if running them at slightly lower voltage would fix it or if it's a software thing.