I was kinda surprised the lack of CAS on per-endpoint plan version or rejecting stale writes via 2PC or single-writer lease per endpoint like patterns.
Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:
See https://news.ycombinator.com/item?id=45681136. The actual DNS mutation API does, effectively, CAS. They had multiple unsynchronized writers who raced without logical constraints or ordering to teh changes. Without thinking much they _might_ have been able to implement something like a vector either through updating the zone serial or another "sentinel record" that was always used for ChangeRRSets affecting that label/zone; like a TXT record containing a serialized change set number or a "checksum" of the old + new state.
Im guessing the "plans" aspect skipped that and they were just applying intended state, without trying serialize them. And last-write-wins, until it doesnt.
Oh, I can see it from here. AWS internally has a problem with things like task orchestration. I bet that the enactor can be rewritten as a goroutine/thread in the planner, with proper locking and ordering.
But that's too complicated and results in more code. So they likely just used an SQS queue with consumers reading from it.