I'm a tedious broken record about this (among many other things) but if you haven't read this Richard Cook piece, I strongly recommend you stop reading this postmortem and go read Cook's piece first. It won't take you long. It's the single best piece of writing about this topic I have ever read and I think the piece of technical writing that has done the most to change my thinking:

https://how.complexsystems.fail/

You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)

That minimalist post mortem for the public is of what sounds like a Rube Goldberg machine and the reality is probably even more hairy. I completely agree that if one wants to understand "root causes", it's more important to understand why such machines are built/trusted/evolved in the first place.

That piece by Cook is ok, but largely just a list of assertions (true or not, most do feel intuitive, though). I suppose one should delve into all those references at the end for details? Anyway, this is an ancient topic, and I doubt we have all the answers on those root whys. The MIT course on systems, 6.033, used to assign reading a paper raised on HN only a few times in its history: https://news.ycombinator.com/item?id=10082625 and https://news.ycombinator.com/item?id=16392223 It's from 1962, over 60 years ago, but that is also probably more illuminating/thought provoking than the post mortem. Personally, I suspect it's probably an instance of a https://en.wikipedia.org/wiki/Wicked_problem , but only past a certain scale.

I have a housing activism meetup I have to get to, but real quick let me just say that these kinds of problems are not an abstraction to me in my day job, that I read this piece before I worked where I do and it bounced off me, but then I read it last year and was like "are you me but just smarter?", like my pupils probably dilated theatrically when I read it like I was a character in Requiem for a Dream, and I think most of the points he's making are much subtler and deeper than they seem at a casual read.

You might have to bring personal trauma to this piece to get the full effect.

Oh, it's fine. At your leisure. I didn't mean to go against the assertions themselves, but more just kind of speak to their "unargued" quality and often sketchy presentation. Even that Simon piece has a lot of this in there, where it's sort of "by defenition of 'complexity'/by unelaborated observation".

In engineered systems, there is just a disconnect between on our own/small scale KISS and what happens in large organizations, and then what happens over time. This is the real root cause/why, but I'm not sure it's fixable. Maybe partly addressable, tho'.

One thing that might give you a moment of worry is both in that Simon and far, far more broadly all over academia both long before and ever since, biological systems like our bodies are an archetypal example of "complex". Besides medical failures, life mostly has this one main trick -- make many copies and if they don't all fail before they, too, can copy then a stable-ish pattern emerges.

Stable populations + "litter size/replication factor" largely imply average failure rates. For most species it is horrific. On the David Attenborough specials they'll play the sad music and tell you X% of these offspring never make it to mating age. The alternative is not the https://en.wikipedia.org/wiki/Gray_goo apocalypse, but the "whatever-that-species-is-biopocalypse". Sorry - it's late and my joke circuits are maybe fritzing. So, both big 'L' and little 'l' life, too, "is on the edge", just structurally.

https://en.wikipedia.org/wiki/Self-organized_criticality (with sand piles and whatnot) used to be a kind of statistical physics hope for a theory of everything of these kinds of phenomena, but it just doesn't get deployed. Things will seem "shallowly critical" but not so upon deeper inspection. So, maybe it's not not a useful enough approximation.

Anyway, good luck with your housing meetup!

As a contractor who is on an oncall schedule. I have never worked in a company that treats oncall as a very serious business. I only worked in 2 companies that need oncall so I’m biased. On paper, they both say it is serious and all SLA stuffs were setup, but in reality there is not enough support.

The problem is, oncall is a full-time business. It takes full attention of the oncall engineer, whether there is an issue or not. Both companies simply treat oncall as a by-product. We just had to do it so let’s stuff it into the sprint. The first company was slightly more serious as we were asked to put up a 2-3 point oncall task in JIRA. The second one doesn’t even do this.

Neither company really encourages engineers to read through complex code written by others, even if we do oncall for those products. Again, the first company did better, and we were supposed to create a channel and pull people in, so it’s OKish to not know anything about the code. The second company simply leaves oncall to do whatever they can. Neither company allocates enough time for engineers to read the source code thoroughly. And neither has good documentation for oncall.

I don’t know the culture of AWS. I’d very much want to work in an oncall environment that is serious and encourages learning.

When I was an SRE at Google our oncall was extremely serious (if the service went down, Google was unable to show ads, record ad impressions, or do any billing for ads). It was done on a rotation, lasted 1 week (IIRC it was 9AM-9PM, we had another time zone for the alternate 12 hours). The on-call was empowered to do pretty much anything required to keep the service up and running, including cancelling scheduled downtimes, pausing deployment updates, stop abusive jobs, stop abusive developers, and invoke an SVP if there was a fight with another important group).

We sent a test page periodically to make sure the pager actually beeped. We got paid extra for being in the rotation. The leadership knew this was a critical step. Unfortunately, much of our tooling was terrible, which would cause false pages, or failed critical operations, all too frequently.

I later worked on SWE teams that didn't take dev oncall very seriously. At my current job, we have an oncall, but it's best effort business hours only.

>empowered to do pretty much anything required to keep the service up and running,

Is that really uncommon? I've been on call for many companies and many types of institutions and never been told once I couldn't do something to bring a system up that I can recall at least. Its kinda the job?

On call seriousness should be directly proportional to pay. Google pays. If smallcorp want to pay me COL I'll be looking at that 2AM ticket at 9AM when I get to work.

That’s pretty good. Our oncall is actually 24-hour for one week. On paper it looks very serious but even the best of us don’t really know everything so issues tend to lag to the morning. Neither do we get any compensation for it. Someone got a bad night and still need to logon next day. There is an informal understanding to relax a bit if the night is too bad, though.

I did 24hr-for-a-week oncall for 10+ years, do not recommend.

12-12 rotation in SRE is a lot more reasonable for humans

Unfortunately 24hr-for-a-week seems to be default everywhere nowdays, its just not practical for serious type businesses. It just an indicator of how important is the UPTIME for a company.

I agree. It sucks. And our schedule is actually 2 weeks in every five. One is secondary and the other is primary.

Handling my first non-prod alert bug as the oncall at Google was pretty eye opening :)

It was a good lesson in what a manicured lower environment can do for you.

Amazon generally treats on call as a full time job. Generally engineers who are on call are expected to only be on call. No feature work.

It's very team/org dependent and I would say that's generally not the case. In 6 years I have only had 1 team out of 3 where that was true. The other two teams I was expected to juggle feature work with oncall work. Same for most teams I interacted with.

Interesting, I've been here nearly that long and every team I've worked with its generally the way I described. Do engineers always do that? No. But it is the expectation

That's actually pretty good.

To quote Grandpa Simpson, "Everything everyone just said is either obvious or wrong".

Pointing out that "complex systems" have "layers of defense" is neither insightful nor useful, it's obvious. Saying that any and all failures in a given complex system lack a root cause is wrong.

Cook uses a lot of words to say not much at all. There's no concrete advise to be taken from How Complex Systems Fail, nothing to change. There's no casualty procedure or post-mortem investigation which would change a single letter of a single word in response to it. It's hot air.

There’s a difference between ‘grown organically’ and ‘designed to operate in this way’, though. Experienced folks will design system components with conscious awareness of how operations actually look like from the start. Juniors won’t and will be bolting on quasi solutions as their systems fall over time and time again. Cook’s generalization is actually wildly applicable, but it takes work to map it to specific situations.

Another great lens to see this is "Normal Accidents" theory, where the argument is made that the most dangerous systems are ones where components are very tightly coupled, interactions are complex and uncontrollable, and consequences of failure are serious.

https://en.wikipedia.org/wiki/Normal_Accidents

As I was reading through that list, I kept feeling, "why do I feel this is not universally true?"

Then I realized: the internet; the power-grid (at least in most developed countries); there are things that don't actually fail catastrophically, even though they are extremely complex, and not always built by efficient organizations. Whats the retort to this argument?

They do fail catastrophically. E.g. https://en.wikipedia.org/wiki/Northeast_blackout_of_2003

I think you could argue AWS is more complex than the electrical grid, but even if it's not, the grid has had several decades to iron out kinks and AWS hasn't. AWS also adds a ton of completely new services each year in addition to adding more capacity. E.g. I bet these DNS Enactors have become more numerous and their plans became much larger than when they were first developed, which has greatly increased the odds of experiencing this issue.

Okay I concede that the power grid was a poor example but clearly the internet is not. No one pointed out a counter for teh internet

Some of the biggest failures have been BGP leaks/hijacks. E.g. https://www.ripe.net/about-us/news/youtube-hijacking-a-ripe-...

This has gotten significantly better in recent years, but it used to be possible and common for a single misbehaving AS to cause global issues.

The power grid absolutely can fail catastrophically and is a lot more fragile than people think.

Texas nearly ran into this during their blackout a few years ago -- their grid got within a few minutes of complete failure that would have required a black start which IIRC has never been done.

Grady has a good explanation and the writeup is interesting reading too.

https://youtu.be/08mwXICY4JM?si=Lmg_9UoDjQszRnMw

https://youtu.be/uOSnQM1Zu4w?si=-v6-Li7PhGHN64LB

The grid fails catastrophically. It happened this year in Portugal, spain and nearby countries? Still, think of the grid as more like DNS. It is immense, but the concept is simple and well understood. You can quickly identify where the fault is (even if not the actual root cause), and can also quickly address it (even if bringing it back up in sync takes time and is not trivial). Current cloud infra is different in that each implementation is unique, services are unique, knowledge is not universal. There are no books about AWS's infra fundamentals or how to manage AWS's cloud.

The power grid is a huge risk in several major western nations.

Also, aviation is great example of how we can manage failures in complex systems and how we can track and fix more and rarer failures over time.

Great link, thanks for sharing. This point below stood out to me — put another way, “fixing” a system in response to an incident to make it safer might actually be making it less safe.

>>> Views of ‘cause’ limit the effectiveness of defenses against future events.

>>> Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents. In fact that likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.

But that sounds like an assertion without evidence and underestimates the competence of everyone involved in designing and maintaining these complex systems.

For example, take airline safety -- are we to believe based on the quoted assertion that every airline accident and resulting remedy that mitigated the causes have made air travel LESS safe? That sounds objectively, demonstrably false.

Truly complex systems like ecosystems and climate might qualify for this assertion where humans have interfered, often with best intentions, but caused unexpected effects that maybe beyond human capacity control.

Airline safety is a special case I think — THE NTSB does incredible work, and their recommendations are always designed to improve total safety, not just reduce the likelihood of a specific failure.

But I can think of lots of examples where the response to an unfortunate, but very rare, incident can make us less safe overall. The response to rare vaccine side effects comes immediately to mind.

I'll admit i didn't read all of either document, but I'm not convinced of the argument that one cannot attribute a failure to a root cause simply because the system is complex and required multiple points of failure to fail catastrophically.

One could make a similar argument in sports that no one person ever scores a point because they are only put into scoring position by a complex series of actions which preceded the actual point. I think that's technically true but practically useless. It's good to have a wide perspective of an issue but I see nothing wrong with identifying the crux of a failure like this one.

The best example for this is aviation. Insanely complex from the machines to the processes to the situations to the people, all interconnected and constantly interacting. But we still do "root cause" analyses and based on those findings try to improve every point in the system that failed or contributed to the failure, because that's how we get a safer aviation industry. It's definitely worked.

Its extremely useful in sports. We evaluate batters on OPS vs RBI, and no one ever evaluated them on runs they happened to score. We talk all the time about a QB and his linemen working together and the receivers. If all we talked about was the immediate cause we'd miss all that.

I'm not saying we ignore all other causes in sports analysis, I'm saying it doesn't make sense to pretend that there's no "one person" who hit the home run or scored a touchdown. Of course it's usually a team effort but we still attribute a score to one person.

Respectfully, I don't think that piece adds anything of material substance. It's a list of hollow platitudes (vapid writing listing inactionable truisms).

A better resource is likely Michael Nygard's book, "Release It!". It has practical advice about many issues in this outage. For example, it appears the circuit breaker and bulkhead patterns were underused here.

Excerpt: https://www.infoq.com/articles/release-it-five-am/

How does knowing this help you avoid these problems? It doesn’t seem to provide any guidance on what to do in the face of complex systems

He's literally writing about Three Mile Island. He doesn't have anything to tell you about what concurrency primitives to use for your distributed DNS management system.

But: given finite resources, should you respond to this incident by auditing your DNS management systems (or all your systems) for race conditions? Or should you instead figure out how to make the Droplet Manager survive (in some degraded state) a partition from DynamoDB without entering congestive collapse? Is the right response an identification of the "most faulty components" and a project plan to improve them? Or is it closing the human expertise/process gap that prevented them from throttling DWFM for 4.5 hours?

Cook isn't telling you how to solve problems; he's asking you to change how you think about problems, so you don't rathole in obvious local extrema instead of being guided by the bigger picture.

It's entirely unclear to me if a system the size and scope of AWS could be re-thought using these principles and successfully execute a complete restructuring of all their processes to reduce their failure rate a bit. It's a system that grew over time with many thousands of different developers, with a need to solve critical scaling issues that would have stopped the business in its tracks (far worse than this outage).

Another point is that DWFM is likely working in a privileged, isolated network because it needs access deep into the core control plane. After all, you don't want a rogue service to be able to add a malicious agent to a customer's VPC.

And since this network is privileged, observability tools, debugging support, and even maybe access to it are more complicated. Even just the set of engineers who have access is likely more limited, especially at 2AM.

Should AWS relax these controls to make recovery easier? But then it will also result in a less secure system. It's again a trade-off.

Both documents are, "ceremonies for engineering personalities."

Even you can't help it - "enumerating a list of questions" is a very engineering thing to do.

Normal people don't talk or think like that. The way Cook is asking us to "think about problems" is kind of the opposite of what good leadership looks like. Thinking about thinking about problems is like, 200% wrong. On the contrary, be way more emotional and way simpler.

I don’t really follow what you are suggesting. If the system is complex and constantly evolving as the article states, you aren’t going to be able to close any expertise process gap. Operating in a degraded state is probably already built in, this was just a state of degradation they were not prepared for. You can’t figure out all degraded states to operate in because by definition the system is complex

[deleted]

Reading through this reminds me a lot of the book "Engineering a Safer World", which opens up talking about some of the largest catastrophies (ferry sinking, chemical plant leaks, etc), and talks about how they went wrong in the framework of systemic thinking. I haven't finished it yet, but even the first part has made me dislike the concept of "root causes", it's more like "emergent behavior".

thanks, i'm one of the lucky 10,000 today.

This particular piece has been shared near me several times, in the context of this recent AWS outage, the previous big AWS outage, non-AWS outages, and others. Every time, I feel like I'm in vague agreement with the author, and at the same time, none of it is the least bit actionable. Even if Cook is correct, so what? There's no concrete change I can make in my working.

Nobody discussing the problem understands it.

[flagged]

I picked a random bullet point to read (9) and I'm pretty sure it's complete nonsense. That's not an example of defensiveness leading to new problems.

Doing this isn't helpful.

Edit: Oops I looked at 8, it's also wrong, the Enactor setting the plan wasn't locally rational, it made a clear mistake. Also that claim has nothing to do with the rest of the paragraph! This output is so bad.

That was a waste of my time.

And I strongly recommend that you stop recommending the reading of something that has its practical usefulness limited by what the treatise leaves unsaid:

  – It identifies problems (complexity, latent failures, hindsight bias, etc.) more than it offers solutions. Readers must seek outside methods to act on these insights.

  – It feels abstract, describing general truths applicable to many domains, but requiring translation into domain-specific practices (be it software, aviation, medicine, etc.).

  – It leaves out discussion on managing complexity – e.g. principles of simplification, modular design, or quantitative risk assessment – which would help prevent some of the failures it warns about.

  – It assumes well-intentioned actors and does not grapple with scenarios where business or political pressures undermine safety – an increasingly pertinent issue in modern industries.

  – It does not explicitly warn against misusing its principles (e.g. becoming fatalistic or overconfident in defenses). The nuance that «failures are inevitable but we still must diligently work to minimize them» must come from the reader’s interpretation.
«How Complex Systems Fail» is highly valuable for its conceptual clarity and timeless truths about complex system behavior. Its direction is one of realism – accepting that no complex system is ever 100% safe – and of placing trust in human skill and systemic defenses over simplistic fixes. The rational critique is that this direction, whilst insightful, needs to be paired with concrete strategies and a proactive mindset to be practically useful.

The treatise by itself won’t tell you how to design the next aircraft or run a data center more safely, but it will shape your thinking so you avoid common pitfalls (such as chasing singular root causes or blaming operators). To truly «preclude» failures or mitigate them, one must extend Cook’s ideas with detailed engineering and organizational practices. In other words, Cook teaches us why things fail in complex ways; it is up to us – engineers, managers, regulators, and front-line practitioners – to apply those lessons in how we build and operate the systems under our care.

To be fair, at the time of writing (late 1990's), Cook’s treatise was breaking ground by succinctly articulating these concepts for a broad audience. Its objective was likely to provoke thought and shift paradigms, rather than serve as a handbook.

Today, we have the benefit of two more decades of research and practice in resilience engineering, which builds on Cook’s points. Practitioners now emphasise building resilient systems, not just trying to prevent failure outright. They use Cook’s insights as rationale for things such as chaos engineering, better incident response, and continuous learning cultures.