It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/...

The bigger concern is how large the git history is going to get on the repository.

I recall that this became a big problem for the Homebrew project in terms of load on the repo, to the extent that Github asked them not to recommend/default-enable shallow clones for their users: https://github.com/Homebrew/brew/issues/15497#issuecomment-1...

This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!

How would shallow clone be more stressful for GitHub than a regular clone?

Shallow clones (and the resulting lack of shared history data) break many assumptions that packfile optimisations rely on.

See also: https://github.com/orgs/Homebrew/discussions/225

This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.

Are they paying for the repo space, I wonder?

someones paying to keep name dropping Iceberg(tm)