Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.

Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.

Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.

We can draw inspiration from older dns infrastructure like the root servers. They use a list of names rather than a single name. We can imagine if the root (".") was a single nameserver that was distributed with anycast, and how a single misconfiguration would bring down the whole internet. Instead we have a list of name servers, operated by different entities, and the only thing that should happen if one goes down is that the next one get used after a timeout.

The article bring up a fairly important point in impact reductions from bugs. Critical systems need to have sanity checks for states and values that never should occur during normal operation, with some corresponding action in case they happen. End-points could have had sanity checks of invalid DNS, such as zero ip-addresses or broken DNS, and either reverted back to an valid state or a predefined emergency system. Either would have reduced the impact.

A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact

Having such a gobstoppingly massive singular region seems to be working against AWS

DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

us-east-2 already exists and wasn’t impacted. And the choice of where to deploy is yours!

Which is great, except for global services that you don't have control on where to deploy to, that ended up being in us-east-1, resulting in issues, no matter where your EC2 instances happened to be.

Like what?

AWS IAM, AWS Organizations, Amazon Route 53 (DNS); AWS services that rely on Route53 include ELB, API Gateway; Amazon S3 bucket creation, some other calls; sts.amazonaws.com is still us-east-1 for many cases; Amazon CloudFront, AWS WAF (for CloudFront), AWS Shield Advanced all have us-east-1 as their control plane.

To be clear, the above list is a control plane depency on us-east-1. During the incident the service itself may have been fine but could not be (re)configured.

The big one is really Route 53 though. DNS having issues caused a lot of downstream effects since it's used by ~everything to talk to everything else.

Other services include Slack. No, it's not an AWS service, but for companies reliant on it or something like it, it doesn't matter how much you're not in EC2 if an outside service that you rely on went down. Full list: https://www.reddit.com/r/DigitalMarketing/comments/1oc2jtd/a...

there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)

They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.

The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.

I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).

This is not exactly true. The az names are indeed randomized per account, and this is the identifier that you see everywhere in the APIs. The difference now is that they also expose a mapping from AZ name (randomized) to AZ id (not randomized), so that you can know that AZ A in one account is actually in the same datacenter as AZ B in a different account. This becomes quite relevant when you have systems spread across accounts but want the communication to stay zonal.

You're both partially right. Some regions have random mapping for AZs; all regions since 2012 have static AZ mapping. https://docs.aws.amazon.com/global-infrastructure/latest/reg...

Oh wow. Thanks for telling me this. I didn't know that this was different for different regions. I just checked some of my accounts, and indeed the mapping is stable between accounts for for example Frankfurt, but not Sydney.

> but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator ...

Tangent, but players of Satisfactory might recognize this condition. If your main power infrastructure somehow goes down, you might not have enough power to start the pumps/downstream machines to power up your main power generators. Thus it's common to have some Tier 0 generators stashed away somewhere to kick start the rest of the system (at least before giant building-sized batteries were introduced a few updates ago).

(Almost) irrelevant question. You wrote "...a bigger generator, which you spin up to start the turbine spinning until the steam takes up load..." I once served on a steam powered US Navy guided missile destroyer. In addition to the main engines, we had four steam turbine electrical generators. There was no need--indeed, no mechanism--for spinning any of these turbines electrically, they all started up simply by sending them steam. (To be sure, you'd better ensure that the lube oil was flowing.)

Are you saying it's different on land-based steam power plants? Why?

Most (maybe all?) large grid-scale generators use electromagnets to generate the electromagnetic field they use to generate electricity. These magnets require electricity to generate that field, so you need a small generator to kickstart your big generator's magnets in order to start producing power. There are other concerns, too; depending on the nature of the plant, there may be some other machinery that requires electricity before the plant can operate. It doesn't take much startup energy to open the gate on a hydroelectric dam, but I don't think anyone is shoveling enough coal to cold-start a coal plant without a conveyer belt.

If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.

Steam turbine generators (this was back in the 1970s, the ship was decommissioned in 2003). No motor to start turning the turbines, just apply steam. The main engines were also steam turbines, and likewise started just by opening the throttle valves--one for forward, another for backing. We did have a jacking gear powered by an electric motor, but it was only used to prevent warping as the engine cooled down when we went cold iron. And of course the boilers ran on fuel oil, not coal--coal went out on naval ships in the early 20th century.

As for how the generators' fields were started, now that you mention it I'm not sure. We did have emergency diesel generators (and of course shore power when we were pier-side), so maybe those supplied electricity to jump-start the generators. But they were 750 kw generators (upgraded in 1974 from 500 kw generators), so I don't imagine batteries would have sufficed.

The "why" would involve different design considerations that a warship that may get shot at with enemy missiles, and the cost incentives for building one, and also because naval destroyers typically aren't connected to the grid, so black starts are much more of a possibility and need to recover from them is often under duress, vs a land-based power station simply isn't going to have those same issues.

Yeap, and get shot at we did--but by artillery, not missiles. The North Vietnamese were rumored to have Russian Styx missile boats, but if they did they never left port.

The post argues for "control theory" and slowing down changes. (Which ... will sure, maybe, but it will slow down convergence, or it complicates things if some class of actions will be faster than others.)

But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)

I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you? It also requires tight coupling between the load balancer and the backends, which has problems of its own.

I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.

thanks for the YT paper!

my point is there's no need to try (and fail) to define some universal backpressure semantics between coupling points, after all this can be done locally, and even after the fact (every time there's an outage, or better yet every time there's a "near miss") the signal to listen to will show up.

and if not, then not, which means (as you said) that link likely doesn't have this kind of simple semantics. maybe because the nature of the integration is not request-response or not otherwise structured to provide this apparent legibility, even if it's causally important for downstream.

simply thinking about this during post-mortems, having metrics available (which is anyway a given in these complex high-availability systems), having the option in the SDK, seems like the way forward

(yes, I know this is basically the circuit breaker and other Netflix-evangelized ideas with extra steps :))

The simplest and most effective strategy we know today to automatically recover that gives the impacted service the ability to avoid entering a metastable state is for clients to implement retries with exponential backoff. No circuit breaker-type functionality is required. Unfortunately it requires that clients be well behaved.

Also, circuit breakers have issues of their own:

“Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted.” https://aws.amazon.com/builders-library/timeouts-retries-and...

Consider a situation in which all the clients have circuit breakers. All of them enter the open state once the trigger condition is met, which drops request load on the service to zero. Your autoscaler reduces capacity to the minimum level in response. Then, all the circuit breakers are reset to the closed state. Your service then experiences a sudden rush of normal- or above-normal traffic, causing it to immediately exhaust availabile capacity. It’s a special case of bimodal behavior, which we try to avoid as a matter of sound operational practice.

Thundering herd is a known post-outage, service restore failure mode. You never let your load balancer and the API boxes behind it dip below some statically defined low water mark; the waste of money is better than going down when the herd shows up. When it does show up, as noted, token bucket rate limiter running 429s while the herd slowly gets let back onto the system. Even if the app server can take it, it's not a given that the eg queue or database systems can absorb the herd as well (especially if that's what went down).

I think a lot problems across different systems have a similar issue. You have a system that needs to have some autonomy (like a flying aeroplane). It has a sources of authority (say a sensor, ATC) but that sometimes is unavailable, delayed, gives wrong data. When that happens we are unwilling to fall back on more autonomy and automation. But there is limited scope for human intervention due to the scale of the problem or just technical difficulty. We reach an inflection point where the only direction left is to give up some element of human control. Accept that systems will sometimes receive bad data and need some autonomy to ignore it when it is contraindicated. And that higher level control is just another source of possible false data.