Hacker News

The droplet manager failure is a lot more forgivable scenario. It happened because the "must always be up" service went down for an extended period of time, and the sheer amount of actions needed for the recovery overwhelmed the system.

The initial DynamoDB DNS outage was much worse. A bog-standard TOCTTOU for scheduled tasks that are assumed to be "instant". And the lack of controls that allowed one task to just blow up everything in one of the foundational services.

When I was at AWS some years ago, there were calls to limit the blast radius by using cell architecture to create vertical slices of the infrastructure for critical services. I guess that got completely sidelined.