Interesting use of the phrase “Route53 transaction” for an operation that has no hard transactional guarantees. Especially given the lack of transactional updates are what caused the outage…

I think you misunderstnad the failure case. The ChangeResourceRecordSet is transactional (or was when I worked on the service) https://docs.aws.amazon.com/Route53/latest/APIReference/API_....

The fault was two different clients with divergent goal states:

- one ("old") DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints

- the DNS Planner continued to run and produced many newer generations of plans [Ed: this is key: its producing "plans" of desired state, the does not include a complete transaction like a log or chain with previous state + mutations]

- one of the other ("new") DNS Enactors then began applying one of the newer plans

- then ("new") invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them [Ed: the key race is implied here. The "old" Enactor is reading _current state_, which was the output of "new", and applying its desired "old" state on top. The discrepency is because apparently Planer and Enactor aren't working with a chain/vector clock/serialized change set numbers/etc]

- At the same time the first ("old") Enactor ... applied its much older plan to the regional DDB endpoint, overwriting the newer plan. [Ed: and here is where "old" Enactor creates the valid ChangeRRSets call, replacing "new" with "old"]

- The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time [Ed: Whoops!]

- The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.

Ironically Route 53 does have strong transactions of API changes _and_ serializes them _and_ has closed loop observers to validate change sets globally on every dataplane host. So do other AWS services. And there are even some internal primitives for building replication or change set chains like this. But its also a PITA and takes a bunch of work and when it _does_ fail you end up with global deadlock and customers who are really grumpy that they dont see their DNS changes going in to effect.

Not for nothing, there’s a support group for those of us who’ve been hurt by WHU sev2s…

Man I always hated that phrasing; always tried to get people to use more precise terms like “customer change propagation.” But yeah, who hasnt been punished by a queryplan change or some random connectivity problem in south east asia!

[deleted]