I think you have to be careful with ideas like "the root cause". They underwent a metastable congestive collapse. A large component of the outage was them not having a runbook to safely recover an adequately performing state for their droplet manager service.

The precipitating event was a race condition with the DynamoDB planner/enactor system.

https://how.complexsystems.fail/

Why can't a race condition bug be seen as the single root cause? Yes, there were other factors that accelerated collapse, but those are inherent to DNS, which is outside the scope of a summary.

Because the DNS race condition is just one flaw in the system. The more important latent flaw† is probably the metastable failure mode for the droplet manager, which, when it loses connectivity to Dynamo, gradually itself loses connectivity with the Droplets, until a critical mass is hit where the Droplet manager has to be throttled and manually recovered.

Importantly: the DNS problem was resolved (to degraded state) in 1hr15, and fully resolved in 2hr30. The Droplet Manager problem took much longer!

This is the point of complex failure analysis, and why that school of thought says "root causing" is counterproductive. There will always be other precipitating events!

which itself could very well be a second-order effect of some even deeper and more latent issue that would be more useful to address!

The droplet manager failure is a lot more forgivable scenario. It happened because the "must always be up" service went down for an extended period of time, and the sheer amount of actions needed for the recovery overwhelmed the system.

The initial DynamoDB DNS outage was much worse. A bog-standard TOCTTOU for scheduled tasks that are assumed to be "instant". And the lack of controls that allowed one task to just blow up everything in one of the foundational services.

When I was at AWS some years ago, there were calls to limit the blast radius by using cell architecture to create vertical slices of the infrastructure for critical services. I guess that got completely sidelined.

Two different questions here.

1. How did it break?

2. Why did it collapse?

A1: Race condition

A2: What you said.

What is the purpose of identifying "root causes" in this model? Is the root cause of a memory corruption vulnerability holding a stale pointer to a freed value, or is it the lack of memory safety? Where does AWS gain more advantage: in identifying and mitigating metastable failure modes in EC2, or in trying to identify every possible way DNS might take down DynamoDB? (The latter is actually not an easy question, but that's the point!)

Two things can be important for an audience. For most, it's the race condition lesson. Locks are there for a reason. For AWS, it's the stability lesson. DNS can and did take down the empire for several hours.

Did DNS take it down, or did a pattern of latent failures take it down? DNS was restored fairly quickly!

Nobody is saying that locks aren't interesting or important.

The Droplet lease timeouts were an aggravating factor for the severity of the incident, but are not causative. Absent a trigger the droplet leases never experience congestive failure.

The race condition was necessary and sufficient for collapse. Absent corrective action it always leads to AWS going down. In the presence of corrective actions the severity of the failure would have been minor without other aggravating factors, but the race condition is always the cause of this failure.

This doesn’t really matter. This type of error gets the whole 5 why’s treatment and every why needs to get fixed. Both problems will certainly have an action item

It is not my claim that AWS is going to handle this badly, only that this thread is.