> Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random.
Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.
And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.
Can you clarify? I thought a major benefit of SSDs is that there isn't any difference between sequential and random access. There's no physical head that needs to move.
Edit: thank you for all the answers -- very educational, TIL!
Lets take the Samsung 9100 Pro M.2 as an example. It has a sequential read rate of ~6700 MB/s and a 4k random read rate of ~80 MB/s:
https://i.imgur.com/t5scCa3.png
https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)
That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.
Datacenter storage will generally not be using M.2 client drives. They employ optimizations that win many benchmarks but sacrifice on consistency multiple dimensions (power loss protection, write performance degrades as they fill, perhaps others).
With SSDs, the write pattern is very important to read performance.
Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.
Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.
If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.
If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.
At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.
That’s literally faster to do a full table scan below a particular table size.
At the 4K random reads impacted by the fact that you still cannot switch Samsung SSDs to 4K native clusters?
I think that is a bigger impact on writes than reads, but certainly means there is some gap from optimal.
To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.
[dead]
SSDs have three block/page sizes:
- The access block size (LBA size). Either 512 bytes or 4096 bytes modulo DIF. Purely a logical abstraction.
- The programming page size. Something in the 4K-64K range. This is the granularity at which an erased block may be programmed with new data.
- The erase block size. Something in the 1-128 MiB range. This is the granularity at which data is erased from the flash chips.
SSDs always use some kind of journaled mapping to cope with the actual block size being roughly five orders of magnitude larger than the write API suggests. The FTL probably looks something like an LSM with some constant background compaction going on. If your writes are larger chunks, and your reads match those chunks, you would expect the FTL to perform better, because it can allocate writes contiguously and reads within the data structure have good locality as well. You can also expect for drives to further optimize sequential operations, just like the OS does.
(N.b. things are likely more complex, because controllers will likely stripe data with the FEC across NAND planes and chips for reliability, so the actual logical write size from the controller is probably not a single NAND page)
SSD controllers and VFSs are often optimized for sequential access (e.g. readahead cache) which leads to software being written to do sequential access for speed which leads to optimization for that access pattern, and so on.
That's not possible. It's not an SSD thing either, it always applies to everything [0].
Sequential access is just the simplest example of predictable access, which is always going to perform better than random access because it's possible to optimize around it. You can't optimize around randomness.
So if you give me your fanciest, fastest random access SSD, I can always hand you back that SSD but now with sequential access faster than the random access.
[0]: RAM access, CPU branch prediction, buying stuff in bulk...
It depends on the side of read - most SSD’s have internal block sizes much larger than a typical (actual) random read, so they internally have to do a lot more work for a given byte of output in a random read situation than they would in a sequential one.
Most filesystems read in 4K chunks (or sometimes even worse, 512 byes), and internally the actual block is often multiple MB in size, so this internal read multiplication is a big factor in performance in those cases.
Note the only real difference between a random read and a sequential one is the size of the read in one sequence before it switches location - is it 4K? 16mb? 2G?
Some discussion in the FragPicker paper (2021) FWIW: https://dl.acm.org/doi/10.1145/3477132.3483593
> Our extensive experiments discover that, unlike HDDs, the performance degradation of modern storage devices incurred by fragmentation mainly stems from request splitting, where a single I/O request is split into multiple ones.
SSD block size is far bigger than 4kB. They still benefit from sequential write
Read up on IOPS, conjoined with requests for sequential reads.
Same with doing things in RAM as well. Sequential writes and cache-friendly reads, which b-trees tend to achieve for any definition of cache. Some compaction/GC/whatever step at some point. Nothing's fundamentally changed, right?
pity Optane which solved for this quite well, was discontinued.
It really is a shame optane is discontinued. For durable low latency writes there really is nothing else out there.