I recall that this became a big problem for the Homebrew project in terms of load on the repo, to the extent that Github asked them not to recommend/default-enable shallow clones for their users: https://github.com/Homebrew/brew/issues/15497#issuecomment-1...
This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!
"The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory."
So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.
Was thinking the same thing. probably once a day would be more than enough.
if you really want a minute by minute probably a delta file from the previous day should be more than enough.
It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/...
The bigger concern is how large the git history is going to get on the repository.
I recall that this became a big problem for the Homebrew project in terms of load on the repo, to the extent that Github asked them not to recommend/default-enable shallow clones for their users: https://github.com/Homebrew/brew/issues/15497#issuecomment-1...
This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways!
How would shallow clone be more stressful for GitHub than a regular clone?
Shallow clones (and the resulting lack of shared history data) break many assumptions that packfile optimisations rely on.
See also: https://github.com/orgs/Homebrew/discussions/225
This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.
Are they paying for the repo space, I wonder?
someones paying to keep name dropping Iceberg(tm)
"The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory."
So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.
Parquet is a very efficient storage approach. Data interfaces tend to treat paths as partitions, if logical.
Was thinking the same thing. probably once a day would be more than enough. if you really want a minute by minute probably a delta file from the previous day should be more than enough.