What puzzles me too is the time it took to recognize an outage.

Looks like there was no monitoring and no alerts.

Which is kinda weird.

I've seen sensitivity get tuned down to avoid false positives during deployments or rolling restarts for host updates. And to a lesser extent for autoscaling noise. It can be hard to get right.

I think it's perhaps a gap in the tools. We apply the same alert criteria at 2 am that we do while someone is actively running deployment or admin tasks and there's a subset that should stay the same, like request failure rate, and others that should be tuned down, like overall error rate and median response times.

And it means one thing if the failure rate for one machine is 90% and something else if the cluster failure rate is 5%, but if you've only got 18 boxes it's hard to discern the difference. And which is the higher priority error may change from one project to another.

Just what you want in a cloud provider, right?