The performance improvements are impressive:

> In Automerge 3.0, we've rearchitected the library so that it also uses the compressed representation at runtime. This has achieved huge memory savings. For example, pasting Moby Dick into an Automerge 2 document consumes 700Mb of memory, in Automerge 3 it only consumes 1.3Mb!

> Finally, for documents with large histories load times can be much much faster (we recently had an example of a document which hadn't loaded after 17 hours loading in 9 seconds!).

I wonder if this is accomplished using controlled buffers in AsyncIterators. I recently built a tool for processing massive CSV files and was able to get the memory usage remarkably low, and control/scale it almost linearly because of how the workers (async iterators) are spawned and their workloads are managed. It kind of blew me away that I could get such fine-tuned control that I'd normally expect from Go or Rust (I'm using Deno for this project).

I'm well above 1.3mb, and although I could get it down there, performance would suffer. I'm curious how fast they sync this data with such tiny memory usage. If the resources were available before, despite using 700mb of memory, was it still faster?

These people are definitely smarter than I am so maybe their solution is a lot more clever than what I'm doing

edit: Oh, they did this part with Rust. I thought it was written in JS. I still wonder: how'd they get memory usage this low, and did it impact speed much? I'll have to dig into it

> I recently built a tool for processing massive CSV files and was able to get the memory usage remarkably low

is it OSS? i'd like to benchmark it against my csv parser :)

No, it's very specific to some watershed sensing data that comes from a bunch of devices strewn about the coast of British Columbia. I'd love to make it (and most of the work I do) OSS if only to share with other scientific groups doing similar work.

Your parser is almost certainly better and faster :) Mine is tailored to a certain schema with specific expectations about foreign keys (well, the concept and artificial enforcement of them) across the documents. This is actually why I've been thinking about using duckdb for this project; it'll allow me to pack the data into the db under multiple schemas with real keys and some primitive type-level constraints. Analysis after that would be sooo much cleaner and faster.

The parsing itself is done with the streams API and orchestrated by a state chart (XState), and while the memory management and concurrency of the whole system is really nice and I'm happy with it, I'm probably making tons of mistakes and trading program efficiency for developer comforts here and there.

The state chart essentially does some grouping operations to pull event data from multiple CSVs, then once it has those events, it stitches them together into smaller portions and ensures each table maps to each other one by the event's ID. It's nice because grouping occurs from one enormous file, and it carves out these groups for the state chart to then organize, validate, and store in parallel. You can configure how much it'll do in parallel, but only because we've got some funny practices here and it's a safety precaution to prevent tying up too many resources on a massive kitchen-sink server on AWS. Haha. So, lots of non-parsing-specific design considerations are baked in.

One day I'll shift this off the giga-server and let it run in isolation with whatever resources it needs, but for now it's baby steps and compromises.

thanks!

They say: "In Automerge 3.0, we've rearchitected the library so that it also uses the compressed representation at runtime. This has achieved huge memory savings."

Right, this didn't click at first but now I understand. I can actually gain similar benefits with my project by switching to storing the data as parquet/duckdb files; I had no idea the potential gains from compressed representations are so significant, so I'd been holding off on testing that out. Thanks for the nudge on that detail!