So the DNS records if-stale-then-needs-update it was basically a variation of the "2 Hard Things In Computer Science - cache invalidation". Excerpt from the giant paragraph:

>[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]

It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.

This is public messaging to explain the problem at large. This isnt really a post incident analysis.

Before the active incident is “resolved” theres an evaluation of probable/plausible reoccurrence. Usually we/they would have potential mitigations and recovery runbooks prepared as well to quickly react to any reoccurance. Any likely open risks are actively worked to mitigate before the immediate issue is considered resolved. That includes around-the-clock dev team work if its the best known path to mitigation.

Next any plausible paths to “risk of reoccurance” would be top dev team priority (business hours) until those action items are completed and in deployment. That might include other teams with similar DIY DNS management, other teams who had less impactful queue depth problems, or other similar “near miss” findings. Service team tech & business owners (PE, Sr PE, GM, VP) would be tracking progress daily until resolved.

Then in the next few weeks at org & AWS level “ops meetings” there are going to be the in depth discussions of the incident, response, underlying problems, etc. the goal there being organizational learning and broader dissemination of lessons learned, action items, best practice etc.

> ...there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?

Can't speak for the current incident but a similar "slow machine" issue once bit our BigCloud service (not as big an incident, thankfully) due to loooong JVM GC pauses on failing hardware.

my take away was that the race condition was the root cause. Take away that bug, and suddenly there's no incident, regardless of any processing delays.

Right.sounds like it’s a case of “rolling your own distributed system algorithm” without the up front investment in implementing a true robust distributed system.

Often network engineers are unaware of some of the tricky problems that DS research has addressed/solved in the last 50 years because the algorithms are arcane and heuristics often work pretty well, until they don’t. But my guess is that AWS will invest in some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates.

Consider this a nudge for all you engineers that are designing fault tolerant distributed systems at scale to investigate the problem spaces and know which algorithms solve what problems.

> some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates

Reading these words makes me break out in cold sweat :-) I really hope they don't

Certainly seems like misuse of DNS. It wasn't designed to be a rapidly updatable consistent distributed database.

That's true, if you use the the CAP definition for consistency. Otherwise, I'd say that the DNS design satisfies each of those terms:

- "Rapidly updatable" depends on the specific implementation, but the design allows for 2 billion changesets in flight before mirrors fall irreparably out of sync with the master database, and the DNS specs include all components necessary for rapid updates: push-based notifications and incremental transfers.

- DNS is designed to be eventually consistent, and each replica is expected to always offer internally consistent data. It's certainly possible for two mirrors to respond with different responses to the same query, but eventual consistency does not preclude that.

- Distributed: the DNS system certainly is a distributed database, if fact it was specifically designed to allow for replication across organization boundaries -- something that very few other distributed systems offer. What DNS does not offer is multi-master operation, but neither do e.g. Postgres or MSSQL.

I think historically DNS was “best effort” but with consensus algorithms like raft, I can imagine a DNS that is perfectly consistent

Further, please don’t stop at RAFT. RAFT is popular because it is easy to understand, not because it is the best way to do distributed consensus. It is non-deterministic (thus requiring odd numbers of electors), requires timeouts for liveness (thus latency can kill you), and isn’t all that good for general-purpose consensus, IMHO.

Why is the "DNS Planner" and "DNS Enactor" separate? If it was one thing, wouldn't this race condition have been much more clear to the people working on it? Is this caused by the explosion of complexity due to the over use of the microservice architecture?

> Why is the "DNS Planner" and "DNS Enactor" separate?

for a large system, it's in practice very nice to split up things like that - you have one bit of software that just reads a bunch of data and then emits a plan, and then another thing that just gets given a plan and executes it.

this is easier to test (you're just dealing with producing one data structure and consuming one data structure, the planner doesn't even try to mutate anything), it's easier to restrict permissions (one side only needs read access to the world!), it's easier to do upgrades (neither side depends on the other existing or even being in the same language), it's safer to operate (the planner is disposable, it can crash or be killed at any time with no problem except update latency), it's easier to comprehend (humans can examine the planner output which contains the entire state of the plan), it's easier to recover from weird states (you can in extremis hack the plan) etc etc. these are all things you appreciate more and more and your system gets bigger and more complicated.

> If it was one thing, wouldn't this race condition have been much more clear to the people working on it?

no

> Is this caused by the explosion of complexity due to the over use of the microservice architecture?

no

it's extremely easy to second-guess the way other people decompose their services since randoms online can't see any of the actual complexity or any of the details and so can easily suggest it would be better if it was different, without having to worry about any of the downsides of the imagined alternative solution.

Agreed, this is a common division of labor and simplifies things. It's not entirely clear in the postmortem but I speculate that the conflation of duties (i.e. the enactor also being responsible for janitor duty of stale plans) might have been a contributing factor.

The Oxide and Friends folks covered an update system they built that is similarly split and they cite a number of the same benefits as you: https://oxide-and-friends.transistor.fm/episodes/systems-sof...

I would divide these as functions inside a monolithic executable. At most, emit the plan to a file on disk as a “—whatif” optional path.

Distributed systems with files as a communication medium are much more complex than programmers think with far more failure modes than they can imagine.

Like… this one, that took out a cloud for hours!

Doing it inside a single binary gets rid of some of the nice observability features you get "for free" by breaking it up and could complicate things quite a bit (more code paths, flags for running it in "don't make a plan use the last plan mode", flags for "use this human generated plan mode"). Very few things are a free lunch but I've used this pattern numerous times and quite like it. I ran a system that used a MIP model to do capacity planning and separating planning from executing a plan was very useful for us.

I think the communications piece depends on what other systems you have around you to build on, its unlikely this planner/executor is completely freestanding. Some companies have large distributed filesystems with well known/tested semantics, schedulers that launch jobs when files appear, they might have ~free access to a database with strict serializability where they can store a serialized version of the plan, etc.

I mean any time a service goes down even 1/100 the size of AWS you have people crawling out of the woodworks giving armchair advice while having no domain relevant experience. It's barely even worth taking the time to respond. The people with opinions of value are already giving them internally.

> The people with opinions of value are already giving them internally.

interesting take, in light of all the brain drain that AWS has experienced over the last few years. some outside opinions might be useful - but perhaps the brain drain is so extreme that those remaining don't realize it's occurring?

Pick your battle I'd guess. Given how huge AWS is, if you have Desired state vs. reconciler, you probably have more resilient operations generally and a easier job of finding and isolating problems, the flip side of that is if you screw up your error handling, you get this. That aside, it seems strange to me they didn't account for the fact that a stale plan could get picked up over a new one, so maybe I misunderstand the incident/architecture.

It probably was a single-threaded python script until somebody found a way to get a Promo out of it.

This is Amazon we’re talking about, it was probably Perl.

This was my thought also. The first sentences of the RCA screamed “race condition” without even having to mention the phrase.

The two DNS components comprise a monolith: neither is useful without the other and there is one arrow on the design coupling them together.

If they were a single component then none of this would have happened.

Also, version checks? Really?

Why not compare the current state against the desired state and take the necessary actions to bring them inline?

Last but not least, deleting old config files so aggressively is a “penny wise pound foolish” design. I would keep these forever or at least a month! Certainly much, much longer than any possible time taken through the sequence of provisioning steps.

Yes it should be impossible for all DNS entries to get deleted like that.

Also, I don't know if I missed it, but they don't establish anything to prevent outage if there's unusually high delay again?

[deleted]

It’s at the end, they disabled the DDB DNS automations around this to fix before they re-enable them

[deleted]

If it's re enabled (without change?), wouldn't an unusually high delay break it again?

Why would they enable it without fixing the issue?

The post-mortem is specific that they won't turn it back on without resolving this but I feel like the default assumption for any halfway competent entity would be that they fix the known issue that they have disabled something because.