Surprised how little comment this post has, this is an insane improvement.
I've been using Electric SQL but Automerge 3.0 seems to be the holy grail combining local first approach to CRDT?
Wondering if I should ditch Electric SQL and switch to this instead. I'm just not sure what kind of hardware I need to run a sync server for Automerge and how many users reads/writes it can support.
ElectricSQL is pretty good too but its still not quite there and implementing local first means some features related to rollback are harder to apply.
I'm still very new to this overall but that 10x memory boost is welcome as I find with very large documents the lag used to be very noticeable.
It really depends on your use case. If you want people collaborating on a rich text document, Automerge or yjs are probably great.
If you want to have local first application data where a server is the authority, ElectricSQL is probably going to serve you best.
That said there are so many approaches out there right now, and they're all promising, but tricky.
The use case is a voice note aggregation system, the notes are stored on S3 and cached locally to desktops and mobile applications. There are transcriptions, AI summaries, user annotations, and structured metadata associated with each voice note. The application will be used by a single human, but he might not always remember to sync or even have an internet connection when he wants to.
Thank you!
If you're building your app for yourself, you likely don't need CRDTs at all.
I don't know much about automerge or other local-first solutions, but a local-first solution that doesn't deal with CRDTs is likely a much better fit for you.
Thank you. I meant that every user will only be interacting with his own files. But yes, they're already are and will be additional users with their own files.
The performance improvements are impressive:
> In Automerge 3.0, we've rearchitected the library so that it also uses the compressed representation at runtime. This has achieved huge memory savings. For example, pasting Moby Dick into an Automerge 2 document consumes 700Mb of memory, in Automerge 3 it only consumes 1.3Mb!
> Finally, for documents with large histories load times can be much much faster (we recently had an example of a document which hadn't loaded after 17 hours loading in 9 seconds!).
I wonder if this is accomplished using controlled buffers in AsyncIterators. I recently built a tool for processing massive CSV files and was able to get the memory usage remarkably low, and control/scale it almost linearly because of how the workers (async iterators) are spawned and their workloads are managed. It kind of blew me away that I could get such fine-tuned control that I'd normally expect from Go or Rust (I'm using Deno for this project).
I'm well above 1.3mb, and although I could get it down there, performance would suffer. I'm curious how fast they sync this data with such tiny memory usage. If the resources were available before, despite using 700mb of memory, was it still faster?
These people are definitely smarter than I am so maybe their solution is a lot more clever than what I'm doing
edit: Oh, they did this part with Rust. I thought it was written in JS. I still wonder: how'd they get memory usage this low, and did it impact speed much? I'll have to dig into it
> I recently built a tool for processing massive CSV files and was able to get the memory usage remarkably low
is it OSS? i'd like to benchmark it against my csv parser :)
No, it's very specific to some watershed sensing data that comes from a bunch of devices strewn about the coast of British Columbia. I'd love to make it (and most of the work I do) OSS if only to share with other scientific groups doing similar work.
Your parser is almost certainly better and faster :) Mine is tailored to a certain schema with specific expectations about foreign keys (well, the concept and artificial enforcement of them) across the documents. This is actually why I've been thinking about using duckdb for this project; it'll allow me to pack the data into the db under multiple schemas with real keys and some primitive type-level constraints. Analysis after that would be sooo much cleaner and faster.
The parsing itself is done with the streams API and orchestrated by a state chart (XState), and while the memory management and concurrency of the whole system is really nice and I'm happy with it, I'm probably making tons of mistakes and trading program efficiency for developer comforts here and there.
The state chart essentially does some grouping operations to pull event data from multiple CSVs, then once it has those events, it stitches them together into smaller portions and ensures each table maps to each other one by the event's ID. It's nice because grouping occurs from one enormous file, and it carves out these groups for the state chart to then organize, validate, and store in parallel. You can configure how much it'll do in parallel, but only because we've got some funny practices here and it's a safety precaution to prevent tying up too many resources on a massive kitchen-sink server on AWS. Haha. So, lots of non-parsing-specific design considerations are baked in.
One day I'll shift this off the giga-server and let it run in isolation with whatever resources it needs, but for now it's baby steps and compromises.
thanks!
They say: "In Automerge 3.0, we've rearchitected the library so that it also uses the compressed representation at runtime. This has achieved huge memory savings."
Right, this didn't click at first but now I understand. I can actually gain similar benefits with my project by switching to storing the data as parquet/duckdb files; I had no idea the potential gains from compressed representations are so significant, so I'd been holding off on testing that out. Thanks for the nudge on that detail!
Probably because i still don't understand what this thing exactly does (and i'm not doing tech since yesterday)
High upvote/comment ratio is a sign of a quality post, honestly. Sometimes all you can do is upvote.