It's the length of the outage that's striking. AWS us-east-1 has had a few serious outages in the last ~decade, but IIRC none took near 14 hours to resolve.
The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.
It's the length of the outage that's striking. AWS us-east-1 has had a few serious outages in the last ~decade, but IIRC none took near 14 hours to resolve.
The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.
Couldn’t this be explained by natural growth of the amount of cloud resources/data under management?
The more you have, the faster the backlog grows in case of an outage, so you need longer to process it all once the system comes back online.
Not really. The issue was the time it took to correctly diagnose the issue and then the cascading failures that resulted triggering more lengthy troubleshooting. Rightly or wrongly it plays into the “the folks that knew best how all this works have left the building” vibes. Folks inside AWS say that’s not entirely inaccurate.