It's entirely possible the move to Azure has made the availability problems worse. Dedicated hardware is much more predictable than cloud. "Let's not move to Azure and instead buy a few more racks" was likely a decision beyond the pay grade of github's management.
Moving to cloud makes scaling much easier and faster than colo data centers, though it cost more and might not be as reliable.
Maybe, but on the other hand, modern hardware is fantastically powerful so you might not need to scale, and github likely has an even and predictable usage pattern which allows them to plan expansion.
Azure is easily the least reliable and least secure of the 3 hyperscalers, which is crazy because GCP was an also-ran underdog not that long ago.
This entire exercise if anything is a huge indictment of Azure.
But that doesn't matter because the kind of person that buys Azure, just like the kind of person that buys MS Teams, is entirely driven by price and does not care about anything else.
> entirely driven by price
I might buy that argument if Azure compensated for its awful availability and security with lower prices.
But the kind of person who buys Azure is the kind of person who buys Windows and Teams, perfectly happy to pay a premium for all the extra abuse.
It's curious how bad people say Azure is. I've never used it, but I've used AWS, and AWS is a gigantic mess. So that makes me concerned if Azure is worse than a gigantic mass.
Azure's management APIs break connections coming from outside Azure's network every time they use DNS to execute a blue/green swap on their public load balancers. Existing connections are not gracefully drained. Terraform state gets corrupted (it thinks the operation failed when it actually succeeded and the resource was actually created) and requires manual fixing.
This happened frequently enough at large enough scale we seriously considered building an automation to attempt to analyze the Terraform logs for the connection breaking and automatically import the created resource.
Azure support was completely worthless.
Azure is worse. These series of posts were posted here not that long ago https://isolveproblems.substack.com/p/how-microsoft-vaporize...
AWS is a complex mess, but it’s pretty good at delivering its services reliably. Azure is a mess that is also unreliable.
I mean its Microsoft and its Azure. How much can go wrong clicking yourself a few/hundred non autoscaling normal VMs?
There is so much workload running on Azure, i never heard of VMs go away.
If Microsoft can source hardware for Azure, Microsoft can source hardware for Github.
there's a lot that can go wrong with a hypervisor, even including hiding hardware issues from the guest OS.
We don't think about it because we've been quite spoiled with excellent virtual machine platforms (KVM, Xen and even VMWare).
Those that have worked a lot with VirtualBox will be aware of this, it can be deeply unnerving that VM technology is the default way to deploy things after you've spent sufficient time with VirtualBox. (which: is very good for its original purpose, but not for reliability).
The question is: Does Azure use something more like VirtualBox, or more like KVM?
HyperV exhibits properties closer to VirtualBox.
HyperV looks like VirtualBox but it's not. It's type 1 like KVM is.
i meant in terms of bubbling up hardware issues.
I've had Windows Server VMs soft crash and hard crash on Azure. Some soft-lock and a restart via Azure gets them back. Some times the only fix has been to power off / deprovision - then power on again (i.e. a restart didn't fix it). It's not common, but I've encountered it multiple times. These are with operating systems that were created in Azure from their images.