These experiences of power outages is weird to me. What I consider "typical" data center design should make it really hard to lose power.
"Typical" design would be: Each cabinet fed by 2 ATS (transfer switch). Each ATS fed by two UPS (battery bank). Each UPS fed by utility with generator backup. The two ATS can share one UPS/generator, so each cabinet would be fed by 3 UPS+generator. A generator failing to start shouldn't be a huge deal, your cabinet should still have 2 others.
The data center I'm currently in did have a power event ~3 years ago, I forget the exact details but Mistakes Were Made (tm). There were several mistakes that led to it, including that one of the ATS had been in "maintenance mode", because they were having problems getting replacement parts, but then something else happened as well. In short, they had gotten behind on maintenance and no longer had N+1 redundancy.
On top of that, their communication was bad. It was snowing cats and dogs, we suddenly lose all services at that facility (an hour away), and I call and their NOC will only tell me "We will investigate it." Not a "We are investigating multiple service outages", just a "we will get back to you." I'm trying to decide if I need to drive multiple hours in heavy snow to be on site, and they're playing coy...
You summed up quite well how these things happen.
All of these parts make for an increasingly complex system with a large number of failure points.
Our DC was a very living entity -- servers were being changed out/rack configuration altered very regularly. Large operations were carefully planned. You wouldn't overlook the power requirements of a few racks being added -- there'd -- literally[0] -- be no place to plug them in without an electrician being brought in. However, in a 3-month period every once in a while, two racks would have old devices replaced either due to failure or refresh, one at a time.
Since they weren't plugged directly into rack batteries (we had two battery rooms providing DC-wide battery backup), the overload wouldn't trip. Since we were still below the capacity of the circuit, the breaker(s) wouldn't trip. And maybe we're still under capacity for our backup system, but a few of the batteries are under-performing.
I think the lesson we learned when this happened was: you need to "actually test" the thing. My understanding is that our tests were of the individual components in isolation. We'd load test the batteries and the generator and then the relays between. At the end of the day, though, if you don't cut the power and see what happens you don't truly know. And my understanding is that having that final step in place resulted in a large number of additional tests being devised "of the individual components" that ensured they never had an outage like that, again.
[0] Guessing it's common practice to make "finding a f!cking power outlet" nearly impossible in DC. Every rack had exactly the number of leads it needed for the hardware plugged into a completely full receptacle. They rolled around a cart with a monitor, printer, label printer, keyboard, mouse and a huge UPS on it so staff could do daily maintenance work.
We had an outage a few years ago on Black Friday; we had been getting our DC to purchase and rack servers for us for years, and the data centre had IIRC four separate circuits that our servers were on depending on which rack they were in. Unfortunately, we hadn't provided them input into which servers were which purpose, and we occasionally repurposed hardware from one service to another.
This resulted in one of our microservices existing entirely on one circuit, along with significant swaths of other services. We also overprovisioned our hardware so that we could cope with huge increases in traffic, e.g. Black Friday/Cyber Monday weekend. Generally a good idea, but since our DC obviously didn't have any visibility into our utilization, they didn't realize that, if our servers suddenly spiked to 100% CPU use, it could triple our power usage.
Easy to see where this is going, I'm sure.
The microservice which existed entirely on one circuit was one of the most important, and was hit constantly - we were a mobile game company, and this service kept track of players' inventories, etc. Not something you want to hit the database for, so we had layers of caching in Redis and memcached, all of which lived on the application servers themselves, all of them clustered so that we could withstand several of our servers going offline. This meant that when we got a massive influx of players all logging in to take advantage of those Black Friday deals, the service hit probably the hardest was this service, and its associated redis and memcached clusters, as well as (to a lesser extent) the primary database and the replication nodes - some of which were also on the same circuit.
So as we're all trying to tune the systems live to optimize for this large influx of traffic, it trips the breaker on that circuit and something like 1/3 of our servers go offline. We call the CEO of the DC company, he has to call to figure out what the heck just happened, and it takes a while to figure out what the heck just went on and why. Someone has to go into the DC to flip the breaker (once they know that it's not just going to fly again), which is a several-hour drive from Vancouver to Seattle.
Meantime, we all have to frantically promote replication databases, re-deploy services to other application servers, and basically try to keep our entire system up and online for the largest amount of traffic we've ever had on 60% of the server capacity we'd planned on.
I was working on that problem (not just awake, but specifically working on that issue) for 23 hours straight after the power went out. Our CEO made a list of every server we had and how we wanted to balance them across the circuits, and then the DC CEO and I spent all night powering off servers one by one, physically moving them to different cabs, bringing them online, rinse repeat.
TL;DR electricity is complicated.
Thanks for sharing, that was a very entertaining read and gave me fond memories of the decade or so I worked in a phone switch.
My office was in a space that was shared by a large legacy phone switch, SONET node, and part of our regional data center but I worked in infrastructure doing software development. My proximity[0] meant I ended up being used to support larger infrastructure efforts at times, but it usually just meant I got a lot of good ... stories.
I wonder if there's a collection of Data Center centric "Daily WTF" stories or something similar.
For me, I think my favorite is when we had multiple unexplained power failures very late at night in our test/management DC[0]. It turned out "the big red kill switch" button behind the plexiglass thing designed to make sure someone doesn't accidentally "lean into it and shut everything off" was mistaken for the "master light switch" by the late night cleaning crew. Nobody thought about "the cleaning crew" because none of the other DCs allowed cleaning crew anywhere near them but this was a test switch (someone forgot about the other little detail). If memory serves, it took a few outages before they figured it out. The facilities manager actually hung around one night trying to witness it only to have the problem not happen (because the cleaning lady didn't turn the lights off when people were there, duh!). I'd like to say that it was almost a "maybe bugs/animals/ghosts are doing it" impulse that caused them to check the cameras but it was probably also the pattern being recognized as "days coinciding with times that the late night cleaning crew does their work."
Outside of that, the guy who made off with something like 4 of these legacy switch cards because some fool put a door stop on the door while moving some equipment in was probably really excited when he found out they were valued at "more than a car" but really disappointed when he put them on eBay for something like $20,000 (which was, I wanna say at least a 50% markdown), was quickly noticed by the supplier[2] was arrested and we were awaiting the return of our hardware.
[0] Among many other things due to a diverse 17-year career, there, but mostly just because I was cooperative/generally A-OK with doing things that "were far from my job" when I could help out my broader organization.
[1] When that went down, we couldn't connect to the management interfaces of any of the devices "in the network". It's bad.
[2] Alarms went off somewhere -- these guys know if you are using their crap, you're stuck with their crap and they really want you stuck paying them for their crap. I'm fairly certain the devices we used wouldn't even function outside of our network but I don't remember the specifics. AFAIK, there's no "pre-owned/liquidation-related market" except for stripping for parts/metals. When these things show up in unofficial channels, they're almost certainly hot.
While what you're describing is definitely possible, but, datacenter architecture is becoming less and less bulletproof-reliable in service of efficiency (both cost as well as PUE).
>> These experiences of power outages is weird to me. What I consider "typical" data center design should make it really hard to lose power.
At least 30% of datacenter outages that we had with a large company were due to some power related issues.
Just a simple small scale one: the technician accidentally plugged in the redundant circuits into the same source power link. When we lost a phase it took down the 2/3 of the capacity instead of 1/3. Hoops.