Just don't use genetically identical hardware:
https://news.ycombinator.com/item?id=32031639
https://news.ycombinator.com/item?id=32032235
Edit: wow, I can't believe we hadn't put https://news.ycombinator.com/item?id=32031243 in https://news.ycombinator.com/highlights. Fixed now.
I’ve seen this up close twice and I’m surprised it’s only twice. Between March and September one year, 6 people on one team had to get new hard drives in their thinkpads and rebuild their systems. All from the same PO but doled out over the course of a project rampup. That was the first project where the onboarding docs were really really good, since we got a lot of practice in a short period of time.
Long before that, the first raid array anyone set up for my (teams’) usage, arrived from Sun with 2 dead drives out of 10. They RMA’d us 2 more drives and one of those was also DOA. That was a couple years after Sun stopped burning in hardware for cost savings, which maybe wasn’t that much of a savings all things considered.
I got burnt by this bug on freakin' Christmas Eve 2020 ( https://forum.hddguru.com/viewtopic.php?f=10&t=40766 ). There was some data loss and a lot of lessons learned.
Many years ago (13?), I was around when Amazon moved SABLE from RAM to SSDs. A whole rack came from a single batch, and something like 128 disks went out at once.
I was an intern but everyone seemed very stressed.
I love that "Ask HN: What'd you do while HN was down?" was a thing
My plan B was going to the Stack Exchange homepage for some interesting threads but it got repetitive.
Man I hit something like that once, a SSD had a firmware bug where it would stop working at an exact number of hours.