How does knowing this help you avoid these problems? It doesn’t seem to provide any guidance on what to do in the face of complex systems
How does knowing this help you avoid these problems? It doesn’t seem to provide any guidance on what to do in the face of complex systems
He's literally writing about Three Mile Island. He doesn't have anything to tell you about what concurrency primitives to use for your distributed DNS management system.
But: given finite resources, should you respond to this incident by auditing your DNS management systems (or all your systems) for race conditions? Or should you instead figure out how to make the Droplet Manager survive (in some degraded state) a partition from DynamoDB without entering congestive collapse? Is the right response an identification of the "most faulty components" and a project plan to improve them? Or is it closing the human expertise/process gap that prevented them from throttling DWFM for 4.5 hours?
Cook isn't telling you how to solve problems; he's asking you to change how you think about problems, so you don't rathole in obvious local extrema instead of being guided by the bigger picture.
It's entirely unclear to me if a system the size and scope of AWS could be re-thought using these principles and successfully execute a complete restructuring of all their processes to reduce their failure rate a bit. It's a system that grew over time with many thousands of different developers, with a need to solve critical scaling issues that would have stopped the business in its tracks (far worse than this outage).
Another point is that DWFM is likely working in a privileged, isolated network because it needs access deep into the core control plane. After all, you don't want a rogue service to be able to add a malicious agent to a customer's VPC.
And since this network is privileged, observability tools, debugging support, and even maybe access to it are more complicated. Even just the set of engineers who have access is likely more limited, especially at 2AM.
Should AWS relax these controls to make recovery easier? But then it will also result in a less secure system. It's again a trade-off.
Both documents are, "ceremonies for engineering personalities."
Even you can't help it - "enumerating a list of questions" is a very engineering thing to do.
Normal people don't talk or think like that. The way Cook is asking us to "think about problems" is kind of the opposite of what good leadership looks like. Thinking about thinking about problems is like, 200% wrong. On the contrary, be way more emotional and way simpler.
I don’t really follow what you are suggesting. If the system is complex and constantly evolving as the article states, you aren’t going to be able to close any expertise process gap. Operating in a degraded state is probably already built in, this was just a state of degradation they were not prepared for. You can’t figure out all degraded states to operate in because by definition the system is complex