My friend and I are usually pretty good at ballparking things of this nature; that is "approximately how much textual data is github storing" and i immediately put an upper bound of a petabyte, there's absolutely no way that github has a petabyte of text.
Assuming just text, deduplication,not being dumb about storage patterns, our range is 40-100TB, and that's probably too high by 10x. 100TB means that the average repo is 100KB, too.
Nearly every arcade machine and pre-2002 console is available as a software "spin" that's <20TB.
How big was "every song on spotify"? 400TB?
the eye is somewhere between a quarter and a half a petabyte.
Wikipedia is ~100GB. It may be more, now, i haven't checked. But the raw DB with everything you need to display the text contained in wikipedia is 50-100GB, and most of that is the markup - that is, not "information for us, but information for the computer"
Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.
We do not believe this has anything to do with the "queries per second" or "writes per second" on the platform. Ballpark, github probably smooths out to around ten thousand queries per second, median. I'd have guessed less, but then again i worked on a photography website database one time that was handling 4000QPS all day long between two servers. 15 years ago.
P.S. just for fun i searched github for `#!/bin/bash` and it returned 15.3mm "code", assume you replace just that with 2 bytes instead of 12, you save 175MB on disk. That's compression; but how many files are duplicated? I don't mean forks with no action, but different projects? Also i don't care to discern the median bash script byte-length on github, but ballparked to 1000 chars/bytes, mean, that's 16GB on disk for just bash scripts :-)
i have ~593 .sh files that everything.exe can see, and 322 are 1KB or less, 100 are 1-2KB, 133 are 2-10KB, and the rest - 38 - are >11KB. of the 1KB ones, a random sample shows they're clustering such that the mean is ~500B.
Veracity unconfirmed, but this article asserts that until they did some cleanup they were storing 19 petabytes.
https://newsletter.betterstack.com/p/how-github-reduced-repo...
maybe sourced from this tweet?
https://x.com/github/status/1569852682239623173
Edit: though maybe that data doesn't count as your "just text" data.
yeah i assume all the artifacts[0] and binaries greatly inflate that. I have no idea how git works under the hood as it is implemented at github, so i can't comment on potential reasons there.
Is there some command a git administrator can issue to see granular statistics, or is "du -sh" the best we can get?
0: i'm assuming a site-rip that only fetches the equivalent files to when you click the "zip download" button, not the releases, not the wikis, images, workers, gists, etc.
I don't think the issue at hand is a technical challenge. It's merely a sign, imo, that usage has surged due to AI. To your point, this is a solvable scaling problem.
My worry is for the business and how they structure pricing. GitHub is able to provide the free services they do because at some point they did the math on what a typical free tier does before they grow into a paid user. They even did the math on what paid users do, so they know they'll still make money when charging whatever amount.
My hunch is AI is a multiplier on usage numbers, which increases OpEx, which means it's eating into GH's assumptions on margin. They will either need to accept a smaller margin, find other ways to shrink OpEx, or restructure their SKUs. The Spotifies and YouTubes of the world hosting other media formats have it harder than them, but they are able to offset the cost of operation by running ads. Can you imagine having to watch a 20 second ad before you can push?
> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.
Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.
oh, i didn't see that the 1.97 billion pages were crawled in a 11 day period earlier this month. either way, nearly 2,000,000,000 pages fit in ~third of a petabyte...
p.s. thanks for correcting me, i was using this information for something else, and now it's correct!