I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.

One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)

This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.

They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.

In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).

Also in general this is a really good deep dive into columnar data storage.

What are the new file format initiatives you're referencing here?

This solution seems clever overall, and finding a way to bolt on features of the latest-and-greatest new hotness without breaking backwards compatibility is a testament to the DataFusion team. Supporting legacy systems is crucial work, even if things need a ground-up rewrite periodically.

Off the top of my head:

- Vortex https://github.com/vortex-data/vortex

- Lance https://github.com/lancedb/lance

- Nimble https://github.com/facebookincubator/nimble

There are also a bunch of ideas coming out of academia, but I don't know how many of them have a sustained effort behind them and not just a couple of papers

Lance (from LanceDB folks), Nimble (from Meta folks, formerly known as Alpha); I think there are a few others

https://github.com/lancedb/lance

https://github.com/facebookincubator/nimble

Yeah I'm happy to see this, we have been curious as part of figuring out cloud native storage extensions to GFQL (graph dataframe-native query lang), and my intuition was parquet was pluggable here... And this is the first I'm seeing a cogent writeup.

Likewise, this means, afaict, it's likewise pretty straightforward to do novel indexing schemes within Iceberg as well just by reusing this.

The other aspect I've been curious about is the happy path pluggable types for custom columns. This shows one way, but I'm unclear if same thing.

nice summary!