I'm much less confident in that assertion. I'm not bullish on AI systems independently taking over operations from humans, but catastrophic outages are combinations of less-catastrophic outages which are themselves combinations of latent failures, and when the latent failures are easy to characterize (as is the case here!), LLMs actually do really interesting stuff working out the combinatorics.

I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.

All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!