Hacker News

Even completely ignoring the issues of language-centric vs data-format-centric serializers, your list is missing two very notable entries from my list: Arrow and Parquet. Both of them go to quite some lengths to efficiently handle optional/missing data efficiently. (I haven’t personally used either one for large data sets, but I have played with them. I think you’ll find that Arrow IPC / Feather (why can’t they just pick one name?) has excellent performance for the actual serialization and deserialization part as long as you do several rows at a time, but Parquet might win for table scans depending on the underlying storage medium.). Both of them are, quite specifically, the result of years of research into storing longish arrays of wide structures with potentially complex shapes and lots of missing data efficiently. (Logical arrays. They’re really struct-of-arrays formats, and I personally have a use case I kind of want to use Feather for except that Feather is not well tuned for emitting one row at a time.)