As I was reading through that list, I kept feeling, "why do I feel this is not universally true?"
Then I realized: the internet; the power-grid (at least in most developed countries); there are things that don't actually fail catastrophically, even though they are extremely complex, and not always built by efficient organizations. Whats the retort to this argument?
They do fail catastrophically. E.g. https://en.wikipedia.org/wiki/Northeast_blackout_of_2003
I think you could argue AWS is more complex than the electrical grid, but even if it's not, the grid has had several decades to iron out kinks and AWS hasn't. AWS also adds a ton of completely new services each year in addition to adding more capacity. E.g. I bet these DNS Enactors have become more numerous and their plans became much larger than when they were first developed, which has greatly increased the odds of experiencing this issue.
Okay I concede that the power grid was a poor example but clearly the internet is not. No one pointed out a counter for teh internet
Some of the biggest failures have been BGP leaks/hijacks. E.g. https://www.ripe.net/about-us/news/youtube-hijacking-a-ripe-...
This has gotten significantly better in recent years, but it used to be possible and common for a single misbehaving AS to cause global issues.
The power grid absolutely can fail catastrophically and is a lot more fragile than people think.
Texas nearly ran into this during their blackout a few years ago -- their grid got within a few minutes of complete failure that would have required a black start which IIRC has never been done.
Grady has a good explanation and the writeup is interesting reading too.
https://youtu.be/08mwXICY4JM?si=Lmg_9UoDjQszRnMw
https://youtu.be/uOSnQM1Zu4w?si=-v6-Li7PhGHN64LB
The grid fails catastrophically. It happened this year in Portugal, spain and nearby countries? Still, think of the grid as more like DNS. It is immense, but the concept is simple and well understood. You can quickly identify where the fault is (even if not the actual root cause), and can also quickly address it (even if bringing it back up in sync takes time and is not trivial). Current cloud infra is different in that each implementation is unique, services are unique, knowledge is not universal. There are no books about AWS's infra fundamentals or how to manage AWS's cloud.
> the internet
https://www.kentik.com/blog/a-brief-history-of-the-internets...
> power grid
https://www.entsoe.eu/publications/blackout/28-april-2025-ib...
The power grid is a huge risk in several major western nations.
Also, aviation is great example of how we can manage failures in complex systems and how we can track and fix more and rarer failures over time.