This is public messaging to explain the problem at large. This isnt really a post incident analysis.
Before the active incident is “resolved” theres an evaluation of probable/plausible reoccurrence. Usually we/they would have potential mitigations and recovery runbooks prepared as well to quickly react to any reoccurance. Any likely open risks are actively worked to mitigate before the immediate issue is considered resolved. That includes around-the-clock dev team work if its the best known path to mitigation.
Next any plausible paths to “risk of reoccurance” would be top dev team priority (business hours) until those action items are completed and in deployment. That might include other teams with similar DIY DNS management, other teams who had less impactful queue depth problems, or other similar “near miss” findings. Service team tech & business owners (PE, Sr PE, GM, VP) would be tracking progress daily until resolved.
Then in the next few weeks at org & AWS level “ops meetings” there are going to be the in depth discussions of the incident, response, underlying problems, etc. the goal there being organizational learning and broader dissemination of lessons learned, action items, best practice etc.