> Skipping flushing the local disk seems rather silly to me
It is. Coordinated failures shouldn't be a surprise these days. It's kind of sad to here that from an AWS engineer. Same data pattern fills the buffers and crashes multiple servers, while they were all "hoping" that others fsynced the data, but it turns out they all filled up and crashed. That's just one case there are others.
Durability always has an asterisk i.e. guaranteed up to N number of devices failing. Once that N is set, your durability is out the moment those N devices all fail together. Whether that N counts local disks or remote servers.
Interestingly, on bare metal or old-school VMs, durability of local storage was pretty good. If the rack power failed, your data was probably still there. Sure, maybe it was only 99% or 99.9%, but that’s not bad if power failures are rare.
AWS etc, in contrast, have really quite abysmal durability for local disk storage. If you want the performance and cost benefits of using local storage (as opposed to S3, EBS, etc), there are plenty of server failure scenarios where the probability that your data is still there hovers around 0%.
This is about not even trying durability before returning a result ("Commit-to-disk on a single system is [...] unnecessary") it's hoping that servers won't crash and restart together: some might fail but others will eventually commit. However that assumes a subset of random (uncoordinated) hardware failures, maybe a cosmic ray blasts the ssd controller. That's fine, but it fails to account for coordinated failure where, a particular workload leads to the same overflow scenario on all servers the same. They all acknowledge the writes to the client but then all crash and restart.
To some extent the only way around that is to use non-uniform hardware though.
Suppose you have each server commit the data "to disk" but it's really a RAID controller with a battery-backed write cache or enterprise SSD with a DRAM cache and an internal capacitor to flush the cache on power failure. If they're all the same model and you find a usage pattern that will crash the firmware before it does the write, you lose the data. It's little different than having the storage node do it. If the code has a bug and they all run the same code then they all run the same bug.
Yeah good point, at least if you wait till you get an acknowledgement for the fsync on N nodes it's already in an a lot better position. Maybe overkill but you can also read the back the data and reverify the checksum. But yeah in general you make a good point, I think that's why some folks deliberately use different drive models and/or raid controllers to avoid cases like that.