Hacker News

Would be interesting to see if this correlated with their release cycles.

At least as of when I left the company, GitHub was being deployed to fairly close to once every 60-90 minutes (the frequency of a deploy train/merge queue batch going out) 24 hours a day, at least during weekdays… there are a fair number of international engineers and deploy trains get crowded during main US business hours, so while there are fewer PRs going out at odd hours US time, there were typically still some. There aren’t dedicated releases as such for GitHub-hosted instances - everything you release needs to be gated behind a feature flag or other mechanism if it’s not going live immediately, and your code either needs to handle the database in both its pre- and post-migrated state, or you need to run the migration in advance of your code shipping out.

Fun fact: it used to be the case that GitHub was actually _less_ reliable if nobody deployed to it… there used to be various resource leaks that we didn’t see when people were deploying all day, since then the app wasn’t getting restarted constantly. After GitHub went down during a holiday break we had volunteers to deploy GitHub once a day during holiday breaks, until the underlying issues were eventually fixed.

everfrustrated 2 hours ago [ - ]

Would love to know more about how they deployed their monolith. If you have anything to share or public links about it.

hosteur 5 hours ago [ - ]

Well, outages seem to be distributed across all days except weekends. So this seems like people fucking around with stuff being a major factor.

samlinnfer 5 hours ago [ - ]

Surely it just means more people working, resulting in more load, resulting in more outages?

pwagland 5 hours ago [ - ]

Or even both. In any kind of continuous deployment, you'd expect outages at the point of deployment, or shortly thereafter as the unintended consequences ripple.

Then the load during the working days makes those ripples larger and into outages.

embedding-shape 5 hours ago [ - ]

Most outages are caused by changes by humans ("actors"?), very rarely are things "People just dig our stuff so much we can't keep up" but more often "We didn't think about this performance drawback when we built thing X, now it's hurting us", and of course, more outages when you try to fix those issues without fully considering the scope and impact.