This is about not even trying durability before returning a result ("Commit-to-disk on a single system is [...] unnecessary") it's hoping that servers won't crash and restart together: some might fail but others will eventually commit. However that assumes a subset of random (uncoordinated) hardware failures, maybe a cosmic ray blasts the ssd controller. That's fine, but it fails to account for coordinated failure where, a particular workload leads to the same overflow scenario on all servers the same. They all acknowledge the writes to the client but then all crash and restart.

To some extent the only way around that is to use non-uniform hardware though.

Suppose you have each server commit the data "to disk" but it's really a RAID controller with a battery-backed write cache or enterprise SSD with a DRAM cache and an internal capacitor to flush the cache on power failure. If they're all the same model and you find a usage pattern that will crash the firmware before it does the write, you lose the data. It's little different than having the storage node do it. If the code has a bug and they all run the same code then they all run the same bug.

Yeah good point, at least if you wait till you get an acknowledgement for the fsync on N nodes it's already in an a lot better position. Maybe overkill but you can also read the back the data and reverify the checksum. But yeah in general you make a good point, I think that's why some folks deliberately use different drive models and/or raid controllers to avoid cases like that.