A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact
Having such a gobstoppingly massive singular region seems to be working against AWS
A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact
Having such a gobstoppingly massive singular region seems to be working against AWS
DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
us-east-2 already exists and wasn’t impacted. And the choice of where to deploy is yours!
Which is great, except for global services that you don't have control on where to deploy to, that ended up being in us-east-1, resulting in issues, no matter where your EC2 instances happened to be.
Like what?
AWS IAM, AWS Organizations, Amazon Route 53 (DNS); AWS services that rely on Route53 include ELB, API Gateway; Amazon S3 bucket creation, some other calls; sts.amazonaws.com is still us-east-1 for many cases; Amazon CloudFront, AWS WAF (for CloudFront), AWS Shield Advanced all have us-east-1 as their control plane.
To be clear, the above list is a control plane depency on us-east-1. During the incident the service itself may have been fine but could not be (re)configured.
The big one is really Route 53 though. DNS having issues caused a lot of downstream effects since it's used by ~everything to talk to everything else.
Other services include Slack. No, it's not an AWS service, but for companies reliant on it or something like it, it doesn't matter how much you're not in EC2 if an outside service that you rely on went down. Full list: https://www.reddit.com/r/DigitalMarketing/comments/1oc2jtd/a...
there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)
They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.
The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.
I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).
This is not exactly true. The az names are indeed randomized per account, and this is the identifier that you see everywhere in the APIs. The difference now is that they also expose a mapping from AZ name (randomized) to AZ id (not randomized), so that you can know that AZ A in one account is actually in the same datacenter as AZ B in a different account. This becomes quite relevant when you have systems spread across accounts but want the communication to stay zonal.
You're both partially right. Some regions have random mapping for AZs; all regions since 2012 have static AZ mapping. https://docs.aws.amazon.com/global-infrastructure/latest/reg...
Oh wow. Thanks for telling me this. I didn't know that this was different for different regions. I just checked some of my accounts, and indeed the mapping is stable between accounts for for example Frankfurt, but not Sydney.