No mention of disk failure rates? curious how it's holding up after a few months

I've mentioned this story before, but we had massive drive failures when bringing up multiple disk arrays. We get them racked on a friday afternoon, and then I wrote a quick and dirty shell script to read/write data back and forth between them over the weekend that was to kick in after they finished striping the raid arrays. By quick and dirty I mean there was no logging, and just a bunch of commands saved as .sh. Came in on Monday to find massive failures in all of the arrays, but no insight into when they failed during the stripe or during stressing them. It was close to 50% failure rate. Turned out to be a bad batch from the factory. Multiple customers of our vendor were complaining. All the drives were replaced by the manufacturer. It just delayed the storage being available to production. After that, not one of them failed in the next 12 months before I left for another job.

> next 12 months before I left for another job

Heh, that's a clever solution to the problem of managing storage through the full 10 year disk lifecycle.

The disk failure rates are very low when compared to decade ago. I used to change more than a dozen disks every week a decade ago. Now it's an eyebrow raising event which I seldom see.

I think following Backblaze's hard disk stats is enough at this point.

Backblaze reports an annual failure rate of 1.36% [0]. Since their cluster uses 2,400 drives, they would likely see ~32 failures a year (extra ~$4,000 annual capex, almost negligible).

[0] https://www.backblaze.com/cloud-storage/resources/hard-drive...

Their rate will probably be higher since they are utilizing used drives. From the spec:

2,400 drives. Mostly 12TB used enterprise drives (3/4 SATA, 1/4 SAS). The JBOD DS4246s work for either.

Not necessarily, since disk failures are typically U-shaped.

Buying used drives eliminates the high rate of early failure (but does get you a bit closer to the 2nd part of the U-curve).

Typically most drives would become more obsolete before hitting the high failure rate of the right side of the U-curve from longevity (7+ years)

I bet you still have a higher early failure rate because of the stress from transportation, even if there's no funny business. And I expect some funny business because used enterprise drives often come with wiped SMART data, some drives may have been retired by sophisticated clients who decided they were near failure.

Physically moving the drive tends to reset the U-shape. Some will be damaged.

[deleted]

They mentioned the cluster being used enterprise drives, I can see the desire to save money but agree, that is going to be one expensive mistake down the road.

I should also note personally for home cluster use, I learned quickly that used drives didn’t seem to make sense. Too much performance variability.

If I remember correctly, most drives either:

1. Fail in the first X amount of time

2. Fail towards the end of their rated lifespan

So buying used drives doesn't seem like the worst idea to me. You've already filtered out the drivers that would fail early.

Disclaimer: I have no idea what I'm talking about

Over in hardware-land we call this "the bathtub curve".

we don't have perfect metrics here but this seems to match our experience; a lot of failures happened shortly after install before the bulk of the data download onto the heap, so actual data loss is lower than hardware failure rates

Where did you source them? I've thought about buying HDDs from a vendor like serverpartdeals.com but was unsure how reliable the drives would be.

Used drives make sense if maintaining your home server is a hobby. It's fun to diagnose and solve problem in home servers, and failing drives give me a reason to work on the server. (I'm only half-joking, it's kind of fun)

in a datacenter context failure rates are just a remote-hands recurring cost so it's not too bad with front-loaders

e.g. have someone show up to the datacenter with a grocery list of slot indices and a cart of fresh drives every few months.

good point