Already enough comments about base rate fallacy, so instead I'll say I'm worried for the future of GitHub.
Its business is underpinned by pre-AI assumptions about usage that, based on its recent instability, I suspect is being invalidated by surges in AI-produced code and commits.
I'm worried, at some point, they'll be forced to take an unpopular stance and either restrict free usage tiers or restrict AI somehow. I'm unsure how they'll evolve.
Having managed GitHub enterprises for thousands of developers who will ping you at the first sign of instability.. I can tell you there has not been one year pre-AI where GitHub was fully "stable" for a month or maybe even a week, and except for that one time with Cocoapods that downtime has always been their own doing.
In a (possibly near) future where most new code is generated by AI bots, the code itself becomes incidental/commodotized and it's nothing more than an intermediate representation (IR) of whatever solution it was prompt-engineered to produce. The value will come from the proposals, reviews, and specifications that caused that code to be produced.
Github is still code-centric with issues and discussions being auxilliary/supporting features around the code. At some point those will become the frontline features, and the code will become secondary.
I'm definitely not an AI skeptic and I use it constantly for coding, but I don't think we are approaching this future at all without a new technological revolution.
Specifications accurate enough to describe the exact behaviors are basically equivalent to code, also in terms of length, so you basically just change language (and current LLM tech is not on course to be able to handle such big specifications)
Higher level specifications (the ones that make sense) leave some details and assumption to the implementation, so you can not safely ignore the implementation itself and you cannot recreate it easily (each LLM build could change the details and the little assumptions)
So yeah, while I agree that documentation and specifications are more and more important in the AI world, I don't see the path to the conclusions you are drawing
This is exactly what people said about the "low code revolution".
Not saying that you are wrong, necessarily. But I think it's still a pretty broad presumption.
I think you're directionally correct, but this stuff still has to live somewhere, whether the repo is code or prompts. GitHub is actually pretty well-positioned to evolve into whatever is next.
I don't think GitHub's product is at risk, but its business model might.
The instability is related to their Azure migration isn't it? Cynically you could say it hasn't been helped by the rolling RIFs at Microsoft
I keep hearing this, and I know Azure has had some issues recently, but I rarely have an issue with Azure like I do with GitHub. I have close to 100 websites on Azure, running on .NET, mostly on Azure App Service (some on Windows 2016 VMs). These sites don't see the type of traffic or amount of features that GitHub has, but if we're talking about Azure being the issue, I'm wondering if I just don't see this because there aren't enough people dependent on these sites compared to GitHub?
Or instead, is it mistakes being made migrating to Azure, rather than Azure being the actual problem? Changing providers can be difficult, especially if you relied on any proprietary services from the old provider.
Running on Azure is not the same as migrating to Azure.
Making big changes like the tech that underpins your product while still actively developing that product means a lot of things in a complicated system changing at once which is usually a recipe for problems.
Incidentally I think that is part of the current problem with AI generated code. Its a fire hose of changes in systems that were never designed or barely holding together at their existing rate of change. AI is able to produce perfectly acceptable code at times but the churn is high and the more code the more churn.
> Its a fire hose of changes in systems that were never designed or barely holding together
Yeah... my career hasn't been that long but I've only ever worked on one system that wasn't held together by duct-tape and a lot that were way more complicated than they needed to be.
Azure is fine, stability wise.
The assumption is it would be mistakes in their migration - edge cases that have to be handled differently either in the infrastructure code, config or application services.
Does anyone actually know? So far I've just seen people guessing, and seeing that repeated.
I dont believe sudden influx of few million bots running 24/7 generating PRa and commits and invoking actions does not impact GitHub.
It even sounds silly when you say it this way.
That is fair, in fact I just came across their recent blog post on this. They're pointing to usage growth as the issue https://github.blog/news-insights/company-news/addressing-gi...
Text is cheap to store and not a lot of people in the world write code. Compare it, for example, to email or something like iCloud.
Also I would guess there would be copy-on-write and other such optimizations at Github. It's unlikely that when you fork a repo, somewhere on a disk the entire .git is being copied (but even if it was, it's not that expensive).
That doesn’t make sense. Commits are all text. If YouTube can easily handle 4PB of uploads a day with essentially one large data center that can handle that much daily traffic for the next 20 years, GitHub should have no problems whatsoever.
My friend and I are usually pretty good at ballparking things of this nature; that is "approximately how much textual data is github storing" and i immediately put an upper bound of a petabyte, there's absolutely no way that github has a petabyte of text.
Assuming just text, deduplication,not being dumb about storage patterns, our range is 40-100TB, and that's probably too high by 10x. 100TB means that the average repo is 100KB, too.
Nearly every arcade machine and pre-2002 console is available as a software "spin" that's <20TB.
How big was "every song on spotify"? 400TB?
the eye is somewhere between a quarter and a half a petabyte.
Wikipedia is ~100GB. It may be more, now, i haven't checked. But the raw DB with everything you need to display the text contained in wikipedia is 50-100GB, and most of that is the markup - that is, not "information for us, but information for the computer"
Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.
We do not believe this has anything to do with the "queries per second" or "writes per second" on the platform. Ballpark, github probably smooths out to around ten thousand queries per second, median. I'd have guessed less, but then again i worked on a photography website database one time that was handling 4000QPS all day long between two servers. 15 years ago.
P.S. just for fun i searched github for `#!/bin/bash` and it returned 15.3mm "code", assume you replace just that with 2 bytes instead of 12, you save 175MB on disk. That's compression; but how many files are duplicated? I don't mean forks with no action, but different projects? Also i don't care to discern the median bash script byte-length on github, but ballparked to 1000 chars/bytes, mean, that's 16GB on disk for just bash scripts :-)
i have ~593 .sh files that everything.exe can see, and 322 are 1KB or less, 100 are 1-2KB, 133 are 2-10KB, and the rest - 38 - are >11KB. of the 1KB ones, a random sample shows they're clustering such that the mean is ~500B.
Veracity unconfirmed, but this article asserts that until they did some cleanup they were storing 19 petabytes.
https://newsletter.betterstack.com/p/how-github-reduced-repo...
maybe sourced from this tweet?
https://x.com/github/status/1569852682239623173
Edit: though maybe that data doesn't count as your "just text" data.
yeah i assume all the artifacts[0] and binaries greatly inflate that. I have no idea how git works under the hood as it is implemented at github, so i can't comment on potential reasons there.
Is there some command a git administrator can issue to see granular statistics, or is "du -sh" the best we can get?
0: i'm assuming a site-rip that only fetches the equivalent files to when you click the "zip download" button, not the releases, not the wikis, images, workers, gists, etc.
I don't think the issue at hand is a technical challenge. It's merely a sign, imo, that usage has surged due to AI. To your point, this is a solvable scaling problem.
My worry is for the business and how they structure pricing. GitHub is able to provide the free services they do because at some point they did the math on what a typical free tier does before they grow into a paid user. They even did the math on what paid users do, so they know they'll still make money when charging whatever amount.
My hunch is AI is a multiplier on usage numbers, which increases OpEx, which means it's eating into GH's assumptions on margin. They will either need to accept a smaller margin, find other ways to shrink OpEx, or restructure their SKUs. The Spotifies and YouTubes of the world hosting other media formats have it harder than them, but they are able to offset the cost of operation by running ads. Can you imagine having to watch a 20 second ad before you can push?
> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.
Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.
oh, i didn't see that the 1.97 billion pages were crawled in a 11 day period earlier this month. either way, nearly 2,000,000,000 pages fit in ~third of a petabyte...
p.s. thanks for correcting me, i was using this information for something else, and now it's correct!
I think the instability is mostly due to the CEO running away at the same time as a forced Azure migration where the VP of engineering ran away. There’s only so much stability you can expect from a ship that’s missing 2 captains.
I mean the fish rots from the head, but at the end of the day that rot translates into an engineering culture that doesn't value craftsmanship and quality. Every github product I've used reeks from sloppiness and poor architecture.
That's not to say they don't have people who can build good things. They built the standard for code distribution after all. But you can't help but recognize so much of it is duct taped together to ship instead of crafted and architected with intent behind major decisions that allow the small shit to just work. If you've ever worked on a similar project that evolved that way, you know the feeling.
This.
But also, GitHub profiles and repos were at one point a window into specific developers - like a social site for coders. Now it's suffering from the same problem that social media sites suffer from - AI-slop and unreliable signals about developers. Maybe that doesn't matter so much if writing code isn't as valuable anymore.
after microsoft acquired it, they greatly expanded the free tier allowances, and they still seem happy to dump money into it
Counterpoint: Ai coding without GitHub is like performing a stunt where you set yourself on fire but without a fire crew to extinguish the flames
> worried for the future of GitHub
Oh no, who would think about the big corporations? How is Micro$lop going to survive? /s
Fuck GitHub. It's a corporate attempt at owning git by sprinkling socials on top. I hope it fails.
If you need to host git + a nice gui (as opposed to needing to promote your shit) Forgejo is free software.
The true value prop of github isn't "hosted git + nice gui", it is the whole ecosystem of contributers, forks, and PRs. You don't get that by hosting your own forge.
Also, I wouldn't say GitHub is a corporate attempt to own git... GitHub is a huge part of why Git is as popular as it is these days, and GitHub started as a small startup.
Of course, you can absolutely say Microsoft bought GitHub in an attempt to own git, but I think you are really underselling the value of the community parts of GitHub.
Or they'll just keep forcing policies that let them steal the code you post on GitHub (for their AI training), and make everyone leave that way.