What is the purpose of identifying "root causes" in this model? Is the root cause of a memory corruption vulnerability holding a stale pointer to a freed value, or is it the lack of memory safety? Where does AWS gain more advantage: in identifying and mitigating metastable failure modes in EC2, or in trying to identify every possible way DNS might take down DynamoDB? (The latter is actually not an easy question, but that's the point!)
Two things can be important for an audience. For most, it's the race condition lesson. Locks are there for a reason. For AWS, it's the stability lesson. DNS can and did take down the empire for several hours.
The Droplet lease timeouts were an aggravating factor for the severity of the incident, but are not causative. Absent a trigger the droplet leases never experience congestive failure.
The race condition was necessary and sufficient for collapse. Absent corrective action it always leads to AWS going down. In the presence of corrective actions the severity of the failure would have been minor without other aggravating factors, but the race condition is always the cause of this failure.
This doesn’t really matter. This type of error gets the whole 5 why’s treatment and every why needs to get fixed. Both problems will certainly have an action item
What is the purpose of identifying "root causes" in this model? Is the root cause of a memory corruption vulnerability holding a stale pointer to a freed value, or is it the lack of memory safety? Where does AWS gain more advantage: in identifying and mitigating metastable failure modes in EC2, or in trying to identify every possible way DNS might take down DynamoDB? (The latter is actually not an easy question, but that's the point!)
Two things can be important for an audience. For most, it's the race condition lesson. Locks are there for a reason. For AWS, it's the stability lesson. DNS can and did take down the empire for several hours.
Did DNS take it down, or did a pattern of latent failures take it down? DNS was restored fairly quickly!
Nobody is saying that locks aren't interesting or important.
The Droplet lease timeouts were an aggravating factor for the severity of the incident, but are not causative. Absent a trigger the droplet leases never experience congestive failure.
The race condition was necessary and sufficient for collapse. Absent corrective action it always leads to AWS going down. In the presence of corrective actions the severity of the failure would have been minor without other aggravating factors, but the race condition is always the cause of this failure.
This doesn’t really matter. This type of error gets the whole 5 why’s treatment and every why needs to get fixed. Both problems will certainly have an action item
It is not my claim that AWS is going to handle this badly, only that this thread is.