Hacker News

I think Ducklake[1] is a terrific example of this. They said "look, let's build a lake house over S3, but for the bit that needs strong consistency (the manifest of which S3 blobs are in play), let's use Postgres". Postgres as a metadata catalog or control plane is brilliant for this, since you get strong consistency and the scaling story around a metadata catalog is far different than the volume of data you need to store. Use S3 for volume, Postgres for consistent metadata.

A similar pattern has spilled out of projects like Warpstream[2], which I suspect is using Postgres behind the scenes of their control plane.

[1]: https://ducklake.select

[2]: https://www.warpstream.com/

munk-a a day ago [ - ]

I have built and maintain a system that uses a very similar system - we register artifacts with UUIDs into S3 in a specifically write-once, never edit, never remove approach and then store those UUIDs in a postgres system. We simply juggle around the connection of other model objects to UUIDs as needed allowing us to achieve safe guarantees without burdening the centralized system with the massive volume (these artifacts are often 50MB+ PDFs). I will mention that I am quite fond of this approach but it's good to be aware that introducing levels of abstraction like this do necessarily widen some fail points on the storage side - if your service uses multiple persistence stores each additional store exposes yet another point where inconsistency could be introduced and/or a message could be lost. Still, fragmenting your data over multiple stores that are particularly well suited for their specialized usages can be huge for performance and cost.

akoboldfrying 19 hours ago [ - ]

If you use hashes of the content itself for your UUIDs, you'll (a) get deduplication and data consistency checking for free and (b) have basically implemented (a subset of) git that uses S3 backing instead of a local filesystem directory :)