my take away was that the race condition was the root cause. Take away that bug, and suddenly there's no incident, regardless of any processing delays.

Right.sounds like it’s a case of “rolling your own distributed system algorithm” without the up front investment in implementing a true robust distributed system.

Often network engineers are unaware of some of the tricky problems that DS research has addressed/solved in the last 50 years because the algorithms are arcane and heuristics often work pretty well, until they don’t. But my guess is that AWS will invest in some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates.

Consider this a nudge for all you engineers that are designing fault tolerant distributed systems at scale to investigate the problem spaces and know which algorithms solve what problems.

> some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates

Reading these words makes me break out in cold sweat :-) I really hope they don't

Certainly seems like misuse of DNS. It wasn't designed to be a rapidly updatable consistent distributed database.

That's true, if you use the the CAP definition for consistency. Otherwise, I'd say that the DNS design satisfies each of those terms:

- "Rapidly updatable" depends on the specific implementation, but the design allows for 2 billion changesets in flight before mirrors fall irreparably out of sync with the master database, and the DNS specs include all components necessary for rapid updates: push-based notifications and incremental transfers.

- DNS is designed to be eventually consistent, and each replica is expected to always offer internally consistent data. It's certainly possible for two mirrors to respond with different responses to the same query, but eventual consistency does not preclude that.

- Distributed: the DNS system certainly is a distributed database, if fact it was specifically designed to allow for replication across organization boundaries -- something that very few other distributed systems offer. What DNS does not offer is multi-master operation, but neither do e.g. Postgres or MSSQL.

I think historically DNS was “best effort” but with consensus algorithms like raft, I can imagine a DNS that is perfectly consistent

Further, please don’t stop at RAFT. RAFT is popular because it is easy to understand, not because it is the best way to do distributed consensus. It is non-deterministic (thus requiring odd numbers of electors), requires timeouts for liveness (thus latency can kill you), and isn’t all that good for general-purpose consensus, IMHO.