Someone else posted about PDX02 going down entirely[0], so sounds like this is the root cause, especially with the latest status update.
> Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.
> [0]: Looks like they lost utility, switched to generator, and then generator failed (not clear on scope of Gen failure yet). Some utility power is back, so recovery is in progress for some portion of the site.
[0]: https://puck.nether.net/pipermail/outages/2023-November/0149...
I think every datacenter I've ever worked with, across ~4 jobs, has had an incident report like "generator failed right as we had an outage."
Am I unlucky, or is there something I miss about datacenter administration that makes it really hard to maintain a generator? I guess you don't hear about times the generator worked, but it feels like a high rate of failure to me.
These experiences of power outages is weird to me. What I consider "typical" data center design should make it really hard to lose power.
"Typical" design would be: Each cabinet fed by 2 ATS (transfer switch). Each ATS fed by two UPS (battery bank). Each UPS fed by utility with generator backup. The two ATS can share one UPS/generator, so each cabinet would be fed by 3 UPS+generator. A generator failing to start shouldn't be a huge deal, your cabinet should still have 2 others.
The data center I'm currently in did have a power event ~3 years ago, I forget the exact details but Mistakes Were Made (tm). There were several mistakes that led to it, including that one of the ATS had been in "maintenance mode", because they were having problems getting replacement parts, but then something else happened as well. In short, they had gotten behind on maintenance and no longer had N+1 redundancy.
On top of that, their communication was bad. It was snowing cats and dogs, we suddenly lose all services at that facility (an hour away), and I call and their NOC will only tell me "We will investigate it." Not a "We are investigating multiple service outages", just a "we will get back to you." I'm trying to decide if I need to drive multiple hours in heavy snow to be on site, and they're playing coy...
You summed up quite well how these things happen.
All of these parts make for an increasingly complex system with a large number of failure points.
Our DC was a very living entity -- servers were being changed out/rack configuration altered very regularly. Large operations were carefully planned. You wouldn't overlook the power requirements of a few racks being added -- there'd -- literally[0] -- be no place to plug them in without an electrician being brought in. However, in a 3-month period every once in a while, two racks would have old devices replaced either due to failure or refresh, one at a time.
Since they weren't plugged directly into rack batteries (we had two battery rooms providing DC-wide battery backup), the overload wouldn't trip. Since we were still below the capacity of the circuit, the breaker(s) wouldn't trip. And maybe we're still under capacity for our backup system, but a few of the batteries are under-performing.
I think the lesson we learned when this happened was: you need to "actually test" the thing. My understanding is that our tests were of the individual components in isolation. We'd load test the batteries and the generator and then the relays between. At the end of the day, though, if you don't cut the power and see what happens you don't truly know. And my understanding is that having that final step in place resulted in a large number of additional tests being devised "of the individual components" that ensured they never had an outage like that, again.
[0] Guessing it's common practice to make "finding a f!cking power outlet" nearly impossible in DC. Every rack had exactly the number of leads it needed for the hardware plugged into a completely full receptacle. They rolled around a cart with a monitor, printer, label printer, keyboard, mouse and a huge UPS on it so staff could do daily maintenance work.
We had an outage a few years ago on Black Friday; we had been getting our DC to purchase and rack servers for us for years, and the data centre had IIRC four separate circuits that our servers were on depending on which rack they were in. Unfortunately, we hadn't provided them input into which servers were which purpose, and we occasionally repurposed hardware from one service to another.
This resulted in one of our microservices existing entirely on one circuit, along with significant swaths of other services. We also overprovisioned our hardware so that we could cope with huge increases in traffic, e.g. Black Friday/Cyber Monday weekend. Generally a good idea, but since our DC obviously didn't have any visibility into our utilization, they didn't realize that, if our servers suddenly spiked to 100% CPU use, it could triple our power usage.
Easy to see where this is going, I'm sure.
The microservice which existed entirely on one circuit was one of the most important, and was hit constantly - we were a mobile game company, and this service kept track of players' inventories, etc. Not something you want to hit the database for, so we had layers of caching in Redis and memcached, all of which lived on the application servers themselves, all of them clustered so that we could withstand several of our servers going offline. This meant that when we got a massive influx of players all logging in to take advantage of those Black Friday deals, the service hit probably the hardest was this service, and its associated redis and memcached clusters, as well as (to a lesser extent) the primary database and the replication nodes - some of which were also on the same circuit.
So as we're all trying to tune the systems live to optimize for this large influx of traffic, it trips the breaker on that circuit and something like 1/3 of our servers go offline. We call the CEO of the DC company, he has to call to figure out what the heck just happened, and it takes a while to figure out what the heck just went on and why. Someone has to go into the DC to flip the breaker (once they know that it's not just going to fly again), which is a several-hour drive from Vancouver to Seattle.
Meantime, we all have to frantically promote replication databases, re-deploy services to other application servers, and basically try to keep our entire system up and online for the largest amount of traffic we've ever had on 60% of the server capacity we'd planned on.
I was working on that problem (not just awake, but specifically working on that issue) for 23 hours straight after the power went out. Our CEO made a list of every server we had and how we wanted to balance them across the circuits, and then the DC CEO and I spent all night powering off servers one by one, physically moving them to different cabs, bringing them online, rinse repeat.
TL;DR electricity is complicated.
Thanks for sharing, that was a very entertaining read and gave me fond memories of the decade or so I worked in a phone switch.
My office was in a space that was shared by a large legacy phone switch, SONET node, and part of our regional data center but I worked in infrastructure doing software development. My proximity[0] meant I ended up being used to support larger infrastructure efforts at times, but it usually just meant I got a lot of good ... stories.
I wonder if there's a collection of Data Center centric "Daily WTF" stories or something similar.
For me, I think my favorite is when we had multiple unexplained power failures very late at night in our test/management DC[0]. It turned out "the big red kill switch" button behind the plexiglass thing designed to make sure someone doesn't accidentally "lean into it and shut everything off" was mistaken for the "master light switch" by the late night cleaning crew. Nobody thought about "the cleaning crew" because none of the other DCs allowed cleaning crew anywhere near them but this was a test switch (someone forgot about the other little detail). If memory serves, it took a few outages before they figured it out. The facilities manager actually hung around one night trying to witness it only to have the problem not happen (because the cleaning lady didn't turn the lights off when people were there, duh!). I'd like to say that it was almost a "maybe bugs/animals/ghosts are doing it" impulse that caused them to check the cameras but it was probably also the pattern being recognized as "days coinciding with times that the late night cleaning crew does their work."
Outside of that, the guy who made off with something like 4 of these legacy switch cards because some fool put a door stop on the door while moving some equipment in was probably really excited when he found out they were valued at "more than a car" but really disappointed when he put them on eBay for something like $20,000 (which was, I wanna say at least a 50% markdown), was quickly noticed by the supplier[2] was arrested and we were awaiting the return of our hardware.
[0] Among many other things due to a diverse 17-year career, there, but mostly just because I was cooperative/generally A-OK with doing things that "were far from my job" when I could help out my broader organization.
[1] When that went down, we couldn't connect to the management interfaces of any of the devices "in the network". It's bad.
[2] Alarms went off somewhere -- these guys know if you are using their crap, you're stuck with their crap and they really want you stuck paying them for their crap. I'm fairly certain the devices we used wouldn't even function outside of our network but I don't remember the specifics. AFAIK, there's no "pre-owned/liquidation-related market" except for stripping for parts/metals. When these things show up in unofficial channels, they're almost certainly hot.
While what you're describing is definitely possible, but, datacenter architecture is becoming less and less bulletproof-reliable in service of efficiency (both cost as well as PUE).
>> These experiences of power outages is weird to me. What I consider "typical" data center design should make it really hard to lose power.
At least 30% of datacenter outages that we had with a large company were due to some power related issues.
Just a simple small scale one: the technician accidentally plugged in the redundant circuits into the same source power link. When we lost a phase it took down the 2/3 of the capacity instead of 1/3. Hoops.
Even the high profile datacenters I had to deal with in Frankfurt had the same issues. There were regular maintenance tests where they made sure the generators were working properly... I can imagine this is more of a pray and sweat task than anything that's in your hands. I have no clue why this is the status quo though.
The phone utility were I live has deisel generators that kick on whenever the power goes out in order to keep the copper phone lines operational. These generators always work, or at least one of the four they have in each office does.
The datacenter I was in for awhile had the big gens, and with similar "phone utility" setups - they would cut to the backup gens once a month and run for longer than the UPS could hold the facility (if they detected an issue, they'd switch back to utility power).
They also had redundant gensets (2x the whole facility, 4x 'important stuff' - you could get a bit of a discount by being willing to be shut off in a huge emergency where gens were dying/running out of fuel).
I wonder why we don't put battery backups in each server/switch/etc. Basically, just be a laptop in each 1U rack space instead of a desktop.
Sure, you can't have much runtime, but if you got like 15 minutes for each device and it always worked, you could smooth over a lot of generator problems when something chews through the building's main grid connection.
It’s pretty common to have a rack of batteries that might serve an isle. The idea of these is that you’d have enough juice for the generator to kick in. You couldn’t run these for longer periods, and even if you could, you’d still have the AC unpowered, which would quickly lead to machines overheating and crashing. Plus the building access controls need powering too. As does lighting, and a whole host of other critical systems. But the AC alone is a far more significant problem than powering the racks. (I’ve worked in places when the AC has failed, it’s not fun. You’d be amazed how much heat those systems can kick out).
In my experience, you have building UPS on one MDU and General supply on the other. Building UPS will power everything until generators spin up, and if the UPS itself dies then you're still powered from general supply
Did lose one building about 20 years ago when the generator didn't start
But then I assume that any services I have which are marked as three-nines or more have to be provided from multiple buildings to avoid that type of single point of failure. The services that need five-nines also take into account loss of a major city, beyond that there's significant disruption though -- especially with internet provision, as it's unclear what internet would be left in a more widespread loss of infrastructure.
One challenge is that the power usage of a server is order(s) of magnitude greater than that of a laptop. This means the cost to do what you describe is significant, hence that has to be taken into account when trying to build a cluster that is competitive...
Yeah, I agree with that. I think that power savings are a big priority for datacenters these days, so perhaps as more efficient chips go into production, the feasibility of "self-contained" servers increases. I could serve a lot of websites from my phone, and I've never had to fire up a diesel generator to have 100% uptime on it. (But, the network infrastructure uses more power than my phone itself. ONT + router is > 20W! The efficiency has to be everywhere for this to work.)
They most likely lie about the power outage. Azure does this all the time. I am so fking tired of these services. Even minor data centers in Europe have backup plans for power.
Cost of that likely is ginormous compared to their SLA obligations.
Had something similar happen at a telecom I worked at for years. We had a diesel generator and a couple of (bathroom sized) rooms full of (what looked like) car batteries. My understanding is that the two rooms were for redundancy. The batteries could power the DC for hours but were used only until the generator was ready.
The area our DC was located in was impressively reliable power-wise and -- in fact -- the backup systems had managed through the multi-state power outage in the early 2000s without a hitch (short of nearly running out of fuel due to our fuel supplier being ... just a little overwhelmed).
A few years later a two minute power outage caused the DC to go dark for a full day. Upon the power failing, the batteries kicked in and a few minutes after that the generator fired up and the DC went into holy terror.
About a minute after the generator kicked in, power to the DC blinked and ended. The emergency lights kicked in, the evacuate alarm sounded[0] and panic ensued.
My very pedestrian understanding of the problem was that a few things failed -- when the generator kicked in, something didn't switch power correctly, then something else didn't trip in response to that, a set of 4 batteries caught fire (and destroyed several nearby). They were extinguished by our facilities manager with a nearby fire extinguisher. He, incidentally, was the one who pulled the alarm (which wouldn't, on its own, trigger the Halon system, I think). The remainder of the day was spent dealing with the aftermath.
We were a global multi-national telecom with a mess of procedures in place for this sort of thing. Everything was installed by electricians, to very exacting standards[1] but -- as with most things "backup" -- the way it was tested and the frequency of those tests was inadequate.
From that point forward (going on over a decade) they thoroughly tested the battery/generator backup once a quarter.
[0] We were warned to GTFO if that alarm goes off due to the flooding of chemicals that would follow a few minutes later. That didn't happen.
[1] I remember the DC manager taking over in Cleveland making his staff work weeks of overtime replacing zip ties with wax lace (and it was done NASA style). We're talking thousands and thousands of runs stretching two complete floors of a skyscraper.
I lost track of how many datacenter outages we caused testing the power backup/failover back at eBay in the mid-2000s.
There's no winning when it comes to power redundancy systems.
Lack of preventive maintenance if I were to guess. Also, these generators would need a supply of diesel fuel, and typically have a storage tank on site. If the diesel isn't used and replaced, it can gum up the generator.
I've gotten 60 year old tractors to run on 60 year old diesel. Gumming up is much more common in gas applications. I guess modern diesel might not be so robust, I know almost nothing about modern engines.
There is nothing so satisfying as when an old engine with bad gas finally catches and starts running continuously.
Generator needs to go from 0% to almost 100% output within a period of a few seconds, UPS battery is often only a few minutes, long enough to generator to stand-up. There’s a reason why when you put your hand on the cylinder heads for that big diesel they are warm. Much like the theatre “You are only as good as your last rehearsal”.
It's not usually the battery backup or the generator that fails. It's usually the switching equipment that has to go from mains to battery to generator to battery to mains. And doing it without causing a voltage sag on the generator.
Running a generator yard is just hard. You are acting as your own power utility with equipment that only runs during tests or outages. Running successfully at commissioning or during tests increases likelihood of service when needed, but is not a guarantee.
https://aws.amazon.com/message/67457/ (AWS: Summary of the AWS Service Event in the US East Region, July 2, 2012)
> On Friday night, as the storm progressed, several US East-1 datacenters in Availability Zones which would remain unaffected by events that evening saw utility power fluctuations. Backup systems in those datacenters responded as designed, resulting in no loss of power or customer impact. At 7:24pm PDT, a large voltage spike was experienced by the electrical switching equipment in two of the US East-1 datacenters supporting a single Availability Zone. All utility electrical switches in both datacenters initiated transfer to generator power. In one of the datacenters, the transfer completed without incident. In the other, the generators started successfully, but each generator independently failed to provide stable voltage as they were brought into service. As a result, the generators did not pick up the load and servers operated without interruption during this period on the Uninterruptable Power Supply (“UPS”) units. Shortly thereafter, utility power was restored and our datacenter personnel transferred the datacenter back to utility power. The utility power in the Region failed a second time at 7:57pm PDT. Again, all rooms of this one facility failed to successfully transfer to generator power while all of our other datacenters in the Region continued to operate without customer impact.
> The generators and electrical switching equipment in the datacenter that experienced the failure were all the same brand and all installed in late 2010 and early 2011. Prior to installation in this facility, the generators were rigorously tested by the manufacturer. At datacenter commissioning time, they again passed all load tests (approximately 8 hours of testing) without issue. On May 12th of this year, we conducted a full load test where the entire datacenter switched to and ran successfully on these same generators, and all systems operated correctly. The generators and electrical equipment in this datacenter are less than two years old, maintained by manufacturer representatives to manufacturer standards, and tested weekly. In addition, these generators operated flawlessly, once brought online Friday night, for just over 30 hours until utility power was restored to this datacenter. The equipment will be repaired, recertified by the manufacturer, and retested at full load onsite or it will be replaced entirely. In the interim, because the generators ran successfully for 30 hours after being manually brought online, we are confident they will perform properly if the load is transferred to them. Therefore, prior to completing the engineering work mentioned above, we will lengthen the amount of time the electrical switching equipment gives the generators to reach stable power before the switch board assesses whether the generators are ready to accept the full power load. Additionally, we will expand the power quality tolerances allowed when evaluating whether to switch the load to generator power. We will expand the size of the onsite 24x7 engineering staff to ensure that if there is a repeat event, the switch to generator will be completed manually (if necessary) before UPSs discharge and there is any customer impact.
Test your backups! Obviously easier said than done of course.
Experience in 'small' high availability safety-critical systems says:
1- 'failover often, failover safely'. Things that run once a month or 'just in case' are the most likely to fail.
2- people (customers) often aren't ready to pay for the cost of designing and operating systems with the availability levels they want.
Datacentre administrators don't know how to run utilities.
Imagine replacing the word "power" with "sewage" and try to see if you would entrust the functionality of your toilet to your local friendly sysadmin.
No. You'd never ask a system administrator to administer your plumbing. Neither should you ask your system administrator to maintain a diesel power generator. Diesel generators have more in common with automobile internal combustion engines and in the high power segment, airplane jet turbines. In fact, many turbine cores are used as both airplane jet engine and as terrestrial power generation unit.
You're basically asking the wrong people to maintain the infrastructure.
When I worked in a DC the HVAC guys did the cooling. The electricians did the power and genset. We also had a local GE guy who did the engine part of the genset. These aren't sysadmin running generators. They are specialists hired for the job.
The more concerning issue here is that their control plane is based out of a single datacenter.
A multi-datacenter setup, which, based on their stack, could just be jobs running on top of a distributed key-value store (and for the uninitiated, this is effectively what Kubernetes is) could greatly alleviate such concerns.
Kubernetes' default datastore, etcd, is not tolerant of latencies between multiple regions. Generally, vanilla k8s clusters have a single-region control plane.
This can just be multiple datacenters located close together (~100 km) similar to AWS AZs.
Fun fact, on certain (major) cloud providers, in certain regions, AZs are sometimes different floors of the same building :)
I like how clear Azure is on this:
> Availability zones are unique physical locations within an Azure region. Each zone is made up of one or more datacenters with independent power, cooling, and networking. The physical separation of availability zones within a region limits the impact to applications and data from zone failures, such as power and cooling failures, large-scale flooding, major storms and superstorms, and other events that could disrupt site access, safe passage, extended utilities uptime, and the availability of resources.
https://learn.microsoft.com/en-us/azure/architecture/high-av...
I think they expanded Tokyo but previously that was a single-building "region"
And it's virtually impossible to make, say, a Singapore region resilient to natural disasters
Did they claim Tokyo was more than one availability zone? If the 'tokyo' region was only ever claimed to be '1 availability zone' I think being in a single building technically still satisfies my quote above.
But yes, agreed.
Yes, in the API you got multiple AZs.
You may be obligated not to name them, but I'm not: Google.
AZ is a term used by AWS and Azure. GCP documentation makes it clear to "Distribute your resources across multiple zones and regions", where regions are physically different data centers.
That actually wasn't the one I was thinking of!
The latency is exceptional.
We run a k8s control plane across datacenter in west, central, and east US and it works fine.
I assume your site to site latency's under 100ms? If so that's fine.