So on the off-chance that there's a firmware engineer in here, how does this actually work?

Like does a SSD do some sort of refresh on power-on, or every N hours, or you have to access the specific block, or...? What if you interrupt the process, eg, having a NVMe in an external case that you just plug once a month for a few minutes to just use it as a huge flash drive, is that a problem?

What about the unused space, is a 4 TB drive used to transport 1 GB of stuff going to suffer anything from the unused space decaying?

It's all very unclear about what all of this means in practice and how's an user supposed to manage it.

SSD firmware engineer here. I work on enterprise stuff, so ymmv on consumer grade internals.

Generally, the data refresh will all happen in the background when the system is powered (depending on the power state). Performance is probably throttled during those operations, so you just see a slightly slower copy while this is happening behind the scenes.

The unused space decaying is probably not an issue, since the internal filesystem data is typically stored on a more robust area of media (an SLC location) which is less susceptible to data loss over time.

As far as how a user is supposed to manage it, maybe do an fsck every month or something? Using an SSD like that is probably ok most of the time, but might not be super great as a cold storage backup.

So say I have a 4TB USB SSD from a few years ago, that's been sitting unpowered in a drawer most of that time. How long would it need to be powered on (ballpark) for the full disk refresh to complete? Assume fully idle.

(As a note: I do have a 4TB USB SSD which did sit in a drawer without being touched for a couple of years. The data was all fine when I plugged it back in. Of course, this was a new drive with very low write cycles and stored climate controlled. Older worn out drive would probably have been an issue.) Just wondering how long I should keep it plugged in if I ever have a situation like that so I can "reset the fade clock" per se.

More certain to just do a full read of the drive to force error correction and updating of any weakening data.

>Generally, the data refresh will all happen in the background when the system is powered (depending on the power state).

How does the SSD know when to run the refresh job? AFAIK SSDs don't have an internal clock so it can't tell how long it's been powered off. Moreover does doing a read generate some sort of telemetry to the controller indicating how strong/weak the signal is, thereby informing whether it should refresh? Or does it blindly refresh on some sort of timer?

Pretty much, but it depends a lot on the vendor and how much you spent on the drive. A lot of the assumptions about enterprise SSDs is that they’re powered pretty much all the time, but are left in a low power state when not in use. So, data can still be refreshed on a timer, as long as it happens within the power budget.

There are several layers of data integrity that are increasingly expensive to run. Once the drive tries to read something that requires recovery, it marks that block as requiring a refresh and rewrites it in the background.

https://www.techspot.com/news/60501-samsung-addresses-slow-8...

samsung fix was aggressive scanning and rewriting in the background

> maybe do an fsck every month or something

Isn't that what periodic "scrub" operations are on modern fs like ZFS/BTRFS/BCacheFS?

> the data refresh will all happen in the background when the system is powered

This confused me. If it happens in the background, what's the manual fsck supposed to be for?

So you need to do an fsck? My big question after reading this article (and others like it) is whether it is enough to just power up the device (for how long?), or if each byte actually needs to be read.

The case an average user is worried about is where they have an external SSD that they back stuff up to on a relatively infrequent schedule. In that situation, the question is whether just plugging it and copying some stuff to it is enough to ensure that all the data on the drive is refreshed, or if there's some explicit kind of "maintenance" that needs to be done.

How long does the data refresh take, approx? Let's say I have an external portable SSD that I keep stored data on. Would plugging the drive into my computer and running

  dd if=/dev/sdX of=/dev/null bs=1M status=progress
work to refresh any bad blocks internally?

A full read would do it, but I think the safer recommendation is to just use a small hdd for external storage. Anything else is just dealing with mitigating factors

Thanks! I think you're right about just using an HDD, but for my portable SSD situation, after a full read of all blocks, how long would you leave the drive plugged in for? Does the refresh procedure typically take a while, or would it be completed in roughly the time it would take to read all blocks?

Keep in mind that when flash memory is read, you don't get back 0 or 1. You get back (roughly) a floating point value -- so you might get back 0.1, or 0.8. There's extensive code in SSD controllers to reassemble/error correct/compensate for that, and LDPC-ish encoding schemes.

Modern controllers have a good idea how healthy the flash is. They will move data around to compensate for weakness. They're doing far more to detect and correct errors than a file system ever will, at least at the single-device level.

It's hard to get away from the basic question, though -- when is the data going to go "poof!" and disappear?

That is when your restore system will be tested.

Typically unused empty space is a good thing, as it will allow drives to run in MLC or SLC mode instead of their native QLC. (At least, this seems to be the obvious implication from performance testing, given the better performance of SLC/MLC compared to QLC.) And the data remanence of SLC/MLC can be expected to be significantly better than QLC.

>as it will allow drives to run in MLC or SLC mode instead of their native QLC

That depends on the SSD controller implementation, specifically whether it proactively moves stuff from the SLC cache to the TLC/QLC area. I expect most controllers to do this, given that if they don't, the drive will quickly lose performance as it fills up. There's basically no reason not proactively move stuff over.