Hacker News

Median database workloads are probably doing writes of just a few bytes per transaction. Ie 'set last_login_time = now() where userid=12345'.

Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.

Sesse__ 2 days ago [ - ]

A write-ahead log isn't a performance tool to batch changes, it's a tool to get durability of random writes. You write your intended changes to the log, fsync it (which means you get a 4k write), then make the actual changes on disk just as if you didn't have a WAL.

If you want to get some sort of sub-block batching, you need a structure that isn't random in the first place, for instance an LSM (where you write all of your changes sequentially to a log and then do compaction later)—and then solve your durability in some other way.

throw0101a 2 days ago [ - ]

> A write-ahead log isn't a performance tool to batch changes, it's a tool to get durability of random writes.

¿Por qué no los dos?

Sesse__ 2 days ago [ - ]

Because it is in addition to your writes, not instead of them. That's what “ahead” points to.

_bohm 2 days ago [ - ]

The actual writes don’t need to be persisted on transaction commit, only the WAL. In most DBs the actual writes won’t be persisted until the written page is evicted from the page cache. In this sense, writing WAL generally does provide better perf than synchronously doing a random page write

Tostino 2 days ago [ - ]

Look up how "checkpointing" works in Postgres.

Sesse__ 2 days ago [ - ]

I know how checkpointing works in Postgres (which isn't very different from how it works in most other redo-log implementations). It still does not change that you need to actually update the heap at some point.

Postgres allows a group commit to try to combine multiple transactions to avoid the multiple fsyncs, but it adds delay and is off by default. And even so, it reduces fsyncs, not writes.

Tostino 2 days ago [ - ]

But it turns those multiplied writes into two more sequential streams of writes. Yeah, it duplicates things, but the purpose is to allow as much sequential IO as possible (along with the other benefits and tradeoffs).

toolslive 2 days ago [ - ]

you can unify database with write-ahead log using a persistent data structure. It also gives you cheap/free snapshots/checkpoints.

formerly_proven 2 days ago [ - ]

WALs are typically DB-page-level physical logs, and database page sizes are often larger than the I/O page size or the host page size.

esperent 2 days ago [ - ]

Don't some SSDs have 512b page size?

digikata 2 days ago [ - ]

I would guess by now none have that internally. As a rule of thumb every major flash density increase (SLC, TLC, QLC) also tended to double internal page size. There were also internal transfer performance reasons for large sizes. Low level 16k-64k flash "pages" are common, and sometimes with even larger stripes of pages due to the internal firmware sw/hw design.

Sesse__ 2 days ago [ - ]

Also due to error correction issues. Flash is notoriously unreliable, so you get bit errors _all the time_ (correcting errors is absolutely routine). And you can make more efficient error-correcting codes if you are using larger blocks. This is why HDDs went from 512 to 4096 byte blocks as well.

zokier 2 days ago [ - ]

They might present 512 blocks to host, but internally the ssd almost certainly manages data in larger pages

cm2187 2 days ago [ - ]

And the filesystem will also likely be 4k block size.