Hacker News

> Most obviously, RCA has an infinite regress problem

Root cause analysis is just like any other tool. Failure to precisely define the nature of the problem is what usually turns RCA into a wild goose chase. Consider the following problem statements:

"The system feels yucky to use. I don't like it >:("

"POST /reports?option=A is slow around 3pm on Tuesdays"

One of these is more likely to provide a useful RCA that proceeds and halts in a reasonable way.

"AWS went down"

Is not a good starting point for a useful RCA session. "AWS" and "down" being the most volatile terms in this context. What parts of AWS? To what extent were they down? Is the outage intrinsic to each service or due to external factors like DNS?

"EC2 instances in region X became inaccessible to public internet users between Y & Z"

This is the kind of grain I would be doing my PPTX along if I was working at AWS. You can determine that there was a common thread after the fact. Put it in your conclusion / next-steps slide. Starting hyper-specific means that you are less likely to get distracted and can arrive at a good answer much faster. Aggregating the conclusions of many reports, you could then prioritize the strategy for preventing this in the future.