Looks interesting for something like local development. I don't intend to run production object storage myself, but some of the stuff in the guide to the production setup (https://garagehq.deuxfleurs.fr/documentation/cookbook/real-w...) would scare me a bit:
> For the metadata storage, Garage does not do checksumming and integrity verification on its own, so it is better to use a robust filesystem such as BTRFS or ZFS. Users have reported that when using the LMDB database engine (the default), database files have a tendency of becoming corrupted after an unclean shutdown (e.g. a power outage), so you should take regular snapshots to be able to recover from such a situation.
It seems like you can also use SQLite, but a default database that isn't robust against power failure or crashes seems suprising to me.
If you know of an embedded key-value store that supports transactions, is fast, has good Rust bindings, and does checksumming/integrity verification by default such that it almost never corrupts upon power loss (or at least, is always able to recover to a valid state), please tell me, and we will integrate it into Garage immediately.
Sounds like a perfect fit for https://slatedb.io/ -- it's just that (an embedded, rust, KV store that supports transactions).
It's built specifically to run on object storage, currently relies on the `object_store` crate but we're consdering OpenDAL instead so if Garage works with those crates (I assume it does if its S3 compatible) it should just work OOTB.
for Garage's particular use case I think SlateDB's "backed by object storage" would be an anti-feature. their usage of LMDB/SQLite is for the metadata of the object store itself - trying to host that metadata within the object store runs into a circular dependency problem.
I’ve used RocksDB for this kind of thing in the past with good results. It’s very thorough from a data corruption detection/rollback perspective (this is naturally much easier to get right with LSMs than B+ trees). The Rust bindings are fine.
It’s worth noting too that B+ tree databases are not a fantastic match for ZFS - they usually require extra tuning (block sizes, other stuff like how WAL commits work) to get performance comparable to XFS/ext4. LSMs on the other hand naturally fit ZFS’s CoW internals like a glove.
I don't really know enough about the specifics here. But my main points isn't about checksums, but more something like WAL in Postgres. For an embedded KV store this is probably not the solution, but my understanding is that there are data structures like LSM that would result in similar robustness. But I don't actually understand this topic well enough.
Checksumming detects corruption after it happened. A database like Postgres will simply notice it was not cleanly shut down and put the DB into a consistent state by replaying the write ahead log on startup. So that is kind of my default expectation for any DB that handles data that isn't ephemeral or easily regenerated.
But I also likely have the wrong mental model of what Garage does with the metadata, as I wouldn't have expected that to be ever limited by Sqlite.
So the thing is, different KV stores have different trade-offs, and for now we haven't yet found one that has the best of all worlds.
We do recommend SQLite in our quick-start guide to setup a single-node deployment for small/moderate workloads, and it works fine. The "real world deployment" guide recommends LMDB because it gives much better performance (with the current status of Garage, not to say that this couldn't be improved), and the risk of critical data loss is mitigated by the fact that such a deployment would use multi-node replication, meaning that the data can always be recovered from another replica if one node is corrupted and no snapshot is available. Maybe this should be worded better, I can see that the alarmist wording of the deployment guide is creating quite a debate so we probably need to make these facts clearer.
We are also experimenting Fjall as an alternate KV engine based on LSM, as it theoretically has good speed and crash resilience, which would make it the best option. We are just not recommending it by default yet, as we don't have much data to confirm that it works up to these expectations.
(genuinely asking) why not SQLite by default?
We were not able to get good enough performance compared to LMDB. We will work on this more though, there are probably many ways performance can be increased by reducing load on the KV store.
Did you try WITHOUT ROWID? Your sqlite implementation[1] uses a BLOB primary key. In SQLite, this means each operation requires 2 b-tree traversals: The BLOB->rowid tree and the rowid->data tree.
If you use WITHOUT ROWID, you traverse only the BLOB->data tree.
Looking up lexicographically similar keys gets a huge performance boost since sqlite can scan a B-Tree node and the data is contiguous. Your current implementation is chasing pointers to random locations in a different b-tree.
I'm not sure exactly whether on disk size would get smaller or larger. It probably depends on the key size and value size compared to the 64 bit rowids. This is probably a well studied question you could find the answer to.
[1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4efc8...
Very interesting, thank you. It would probably make sense for most tables but not all of them because some are holding large CRDT values.
Other than knowing this about SQLite beforehand, is there any way one could discover that this is happening through tracing?
Keep in mind that write safety comes with performance penalties. You can turn off write protections and many databases will be super fast, but easily corrupt.
I learned that Turso apparently have plans for a rewrite of libsql [0] in Rust, and create a more 'hackable' SQLite alternative altogether. It was apparently discussed in this Developer Voices [1] video, which I haven't yet watched.
[0] https://github.com/tursodatabase/libsql
[1] https://www.youtube.com/watch?v=1JHOY0zqNBY
Could you use something like Fly's Corrosion to shard and distribute the SQLite data? It uses a CRDT reconciliation, which is familiar for Garage.
Garage already shards data by itself if you add more nodes, and it is indeed a viable path to increasing throughput.
I’ve not looked at it in a while but sled/rio were interesting up and coming options https://github.com/spacejam/sled
Fjall
https://github.com/fjall-rs/fjall
RocksDB possibly. Used in high throughput systems like Ceph OSDs.
Valkey?
It's "key/value store", FYI
It's not a store of "keys or values", no. It's a store of key-value pairs.
A key-value store would be a store of one thing: key values. A hyphen combines two words to make an adjective, which describes the word that follows:
When you have two exclusive options, two sides to a situation, or separate things; you separate them with a slash: Therefore a key-value store and a key/value store are quite different.All of your slash examples represent either–or situations. A swich turns it on or off, the situation is a win in the first outcome or a win in the second outcome, etc.
It's true that key–value store shouldn't be written with a hyphen. It should be written with an en dash, which is used "to contrast values or illustrate a relationship between two things [... e.g.] Mother–daughter relationship"
https://en.wikipedia.org/wiki/Dash#En_dash
I just didn't want to bother with typography at that level of pedanticism.
No, they don't. A master/slave configuration (of hard drives, for example) involves two things. I specifically included it to head off the exact objection you're raising.
"...the slash is now used to represent division and fractions, as a date separator, in between multiple alternative or related terms"
-Wikipedia
And what is a key/value store? A store of related terms.
And if you had a system that only allowed a finite collection of key values, where might you put them? A key-value store.
The hard drives are either master or slave. A hard drive is not a master-and-slave.
Exactly. And an entry in a key/value store is either a key or a value. Not both.
No, an entry is a key-and-value pair. Are you deriously suggesting it is possible to add only keys without corresponding values, or vice versa?
Wikipedia seems to find "key-value store" an appropriate term.
https://en.wikipedia.org/wiki/Key%E2%80%93value_database
See above.
Which is infinite of value is zero.
Depending on the underlying storage being reliable is far from unique to garage. This is what most other services do too, unless we're talking about something like Ceph which manages the physical storage itself.
Standard filesystems such as ext4 and xfs don't have data checksumming, so you'll have to rely on another layer to provide integrity. Regardless, that's not garage's job imo. It's good that they're keeping their design simple and focus their resources on implementing the S3 spec.
That's not something you can do reliably in software, datacenter grade NVMe drives come with power loss protection and additional capacitors to handle that gracefully. If power is cut at the wrong moment the partition may not be mountable afterwards otherwise.
If you really live somewhere with frequent outages, buy an industrial drive that has a PLP rating. Or get a UPS, they tend to be cheaper.
Isn't that the entire point of write-ahead logs, journaling file systems, and fsync in general? A roll-back or roll-forward due to a power loss causing a partial write is completely expected, but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted?
As I understood it, the capacitors on datacenter-grade drives are to give it more flexibility, as it allows the drive to issue a successful write response for cached data: the capacitor guarantees that even with a power loss the write will still finish, so for all intents and purposes it has been persisted, so an fsync can return without having to wait on the actual flash itself, which greatly increases performance. Have I just completely misunderstood this?
you actually don't need capacitors for rotating media, Western Digital has a feature called "ArmorCache" that uses the rotational energy in the platters to power the drive long enough to sync the volatile cache to a non volatile storage.
https://documents.westerndigital.com/content/dam/doc-library...
Very cool, like the ram air turbine that deploys on aircraft in the event of a power loss.
Good I love engineers
> but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted?
That doesn't even help if fsync() doesn't do what developers expect: https://danluu.com/fsyncgate/
I think this was the blog post that had a bunch more stuff that can go wrong too: https://danluu.com/deconstruct-files/
But basically fsync itself (sometimes) has dubious behaviour, then OS on top of kernel handles it dubiously, and then even on top of that most databases can ignore fsync erroring (and lie that the data was written properly)
So... yes.
> ignore fsync and blatantly lie that the data has been persisted
Unfortunately they do: https://news.ycombinator.com/item?id=38371307
If the drives continue to have power, but the OS has crashed, will the drives persist the data once a certain amount of time has passed? Are datacenters set up to take advantage of this?
> will the drives persist the data once a certain amount of time has passed
Yes, otherwise those drives wouldn't work at all and would have a 100% warranty return rate. The reason they get away with it is that the misbehavior is only a problem in a specific edge-case (forgetting data written shortly before a power loss).
Yes, the drives are unaware of the OS state.
I've been using minio for local dev but that version is unmaintained now. However, I was put off by the minimum requirements for garage listed on the page -- does it really need a gig of RAM?
I always understood this requirement as "garage will run fine on hardware with 1GB RAM total" - meaning the 1GB includes the RAM used by the OS and other processes. I think that most current consumer hardware that is a, potential garage host, even on the low end, has at least 1GB total RAM.
The current latest Minio release that is working for us for local development is now almost a year old and soon enough we will have to upgrade. Curious what others have replaced it with that is as easy to set up and has a management UI.
I think that's part of the pitch here... swapping out Minio for Garage. Both scale a lot more than for just local development, but local dev certainly seems like a good use-case here.
It does not, at least not for a small local dev server. I believe RAM usage should be around 50-100MB, increasing if you have many requests with large objects.
The assumption is nodes are in different fault domains so it'd be highly unlikely to ruin the whole cluster.
LMDB mode also runs with flush/syncing disabled