> You cannot pre-emptively predict all possible bugs and race conditions. How can I create alerts for all of the failure conditions I have not thought of?
You can’t. That’s just life. The electrical and building codes didn’t start as thousand-page tomes, but as we gained experience over the course of countless incidents, the industry recorded those lessons as prescriptions. Every rule was written in blood, as they say, and now practitioners are bound to follow them. We don’t have the same regulatory framework to ensure we build resilient services, but on the other hand, nobody has died or been seriously injured as a consequence of an internet service failure.
> A better assumption is that all systems will fail, and one of the things you can do to reduce failure rate is to simplify the system.
Why not do both? However, some systems have irreducible complexity for good reason, and it is better to see whether that is in fact the case before proposing armchair prescriptions.
> Additionally, you can segment the system into shards/cells and roll out config and code changes serially to each cell to catch issues before they affect 100% of customers.
I was formerly the lead of the AWS Well-Architected reliability pillar. You’re describing an AWS design and operating principle, and many services do just that (I’m not sure about DynamoDB but it would surprise me if they didn’t). However, at the end of the day, there is a single regional service endpoint customers use.
> I am not hand waving or yelling at the clouds here. I have worked on service discovery for hyperscalars and have witnessed similar outages where the impedance mismatch between internal service discovery and DNS causes issues.
Nobody is accusing you of such behavior, but you also haven’t proposed a concretely better solution, and the one you have mentioned in other replies (Envoy xDS) isn’t built for purpose. It might work fine in the context of a Kubernetes cluster, but it’s certainly not appropriate for Internet-scale service discovery or the planetary scale edge service fabric that fly.io is building.