Author could have started by surveying current state of art instead of just falsely assuming that DB devs have just been resting on the laurels for past decades. If you want to see (relational) DB for SSD just check out stuff like myrocks on zenfs+; it's pretty impressive stuff.
There has also been some significant academic study of DBMS design for persistent memory - which SSD technology can serve as (e.g. as NVDIMMs or abstractly) : Think of no distinction between primary and secondary storage, RAM and disk - there's just a huge amount of not-terribly-fast memory; and whatever you write to memory never goes away. It's an interesting model.
bcachefs's btree still beats the pants off of the entire rocksdb lineage :)
Aren't B-trees and LSM-trees fundamentally different tradeoffs? B-trees will always win in some read-biased workloads, and LSM-trees in other write-biased workloads (with B epsilon (Bε) trees somewhere in the middle).
For on disk data structures, yes.
LSM-trees do really badly at multithreaded update workloads, and compaction overhead is really problematic when there isn't much update locality.
On the other hand, having most of your index be constant lets you use better data structures. Binary search is really bad.
For pure in memory indexes, according to the numbers I've seen it's actually really hard to beat a pure (heavily optimized) b-tree; for in-memory you use a much smaller node size than on disk (I've seen 64 bytes, I'd try 256 if I was writing one).
For on disk, you need to use a bigger node size, and then binary search is a problem. And 4k-8k as is still commonly used is much too small; you can do a lockless or mostly lockless in-memory b-tree, but not if it's persistent, so locking overhead, cache lookups, all become painful for persistent b-trees at smaller node sizes, not to mention access time on cache miss.
So the reason bcachefs's (and bcache's) btree is so fast is that we use much bigger nodes, and we're actually a hybrid compacting data structure. So we get the benefits of LSM-trees (better data structures to avoid binary search for most of a lookup) without the downsides, and having the individual nodes be (small, simple) compacting data structures is what makes big btree nodes (avoiding locking overhead, access time on node traversal) practical.
B-epsilon btrees are dumb, that's just taking the downsides of both - updating interior nodes in fastpaths kills multithreaded performance.
Rocksdb / myrocks is heavily used by Meta at extremely massive scale. For sake of comparison, what's the largest real-world production deployment of bcachefs?
We're talking about database performance here, not deployment numbers. And personally, I don't much care what Meta does, they're not pushing the envelope on reliability anywhere that I know of.
Many other companies besides Meta use RocksDB; they're just the largest.
Production adoption at scale is always relevant as a measure of stability, as well as a reflection of whether a solution is applicable to general-purpose workloads.
There's more to the story than just raw performance anyway; for example Meta's migration to MyRocks was motivated by superior compression compared to other alternatives.
But then how would they have anything to do?
> myrocks
anything like this, but for postgres?
actually, is it even possible to write a new db engine for postgres? like mysql has innodb, myisam, etc
Postgres's strategy has traditionally been to focus on pluggable indexing methods which can be provided by extensions, rather than completely replacing the core heap storage engine design for tables.
That said, there are a few alternative storage engines for Postgres, such as OrioleDB. However due to limitations in Postgres's storage engine API, you need to patch Postgres to be able to use OrioleDB.
MySQL instead focused on pluggable storage engines from the get-go. That has had major pros and cons over the years. On the one hand, MyISAM is awful, so pluggable engines (specifically InnoDB) are the only thing that "saved" MySQL as the web ecosystem matured. It also nicely forced logical replication to be an early design requirement, since with a multi-engine design you need a logical abstraction instead of a physical one.
But on the other hand, pluggable storage introduces a lot of extra internal complexity, which has arguably been quite detrimental to the software's evolution. For example: which layer implements transactions, foreign keys, partitioning, internal state (data dictionary, users/grants, replication state tracking, etc). Often the answer is that both the server layer and the storage engine layer would ideally need to care about these concerns, meaning a fully separated abstraction between layers isn't possible. Or think of things like transactional DDL, which is prohibitively complex in MySQL's design so it probably won't ever happen.