Hacker News

We had an outage a few years ago on Black Friday; we had been getting our DC to purchase and rack servers for us for years, and the data centre had IIRC four separate circuits that our servers were on depending on which rack they were in. Unfortunately, we hadn't provided them input into which servers were which purpose, and we occasionally repurposed hardware from one service to another.

This resulted in one of our microservices existing entirely on one circuit, along with significant swaths of other services. We also overprovisioned our hardware so that we could cope with huge increases in traffic, e.g. Black Friday/Cyber Monday weekend. Generally a good idea, but since our DC obviously didn't have any visibility into our utilization, they didn't realize that, if our servers suddenly spiked to 100% CPU use, it could triple our power usage.

Easy to see where this is going, I'm sure.

The microservice which existed entirely on one circuit was one of the most important, and was hit constantly - we were a mobile game company, and this service kept track of players' inventories, etc. Not something you want to hit the database for, so we had layers of caching in Redis and memcached, all of which lived on the application servers themselves, all of them clustered so that we could withstand several of our servers going offline. This meant that when we got a massive influx of players all logging in to take advantage of those Black Friday deals, the service hit probably the hardest was this service, and its associated redis and memcached clusters, as well as (to a lesser extent) the primary database and the replication nodes - some of which were also on the same circuit.

So as we're all trying to tune the systems live to optimize for this large influx of traffic, it trips the breaker on that circuit and something like 1/3 of our servers go offline. We call the CEO of the DC company, he has to call to figure out what the heck just happened, and it takes a while to figure out what the heck just went on and why. Someone has to go into the DC to flip the breaker (once they know that it's not just going to fly again), which is a several-hour drive from Vancouver to Seattle.

Meantime, we all have to frantically promote replication databases, re-deploy services to other application servers, and basically try to keep our entire system up and online for the largest amount of traffic we've ever had on 60% of the server capacity we'd planned on.

I was working on that problem (not just awake, but specifically working on that issue) for 23 hours straight after the power went out. Our CEO made a list of every server we had and how we wanted to balance them across the circuits, and then the DC CEO and I spent all night powering off servers one by one, physically moving them to different cabs, bringing them online, rinse repeat.

TL;DR electricity is complicated.

Thanks for sharing, that was a very entertaining read and gave me fond memories of the decade or so I worked in a phone switch.

My office was in a space that was shared by a large legacy phone switch, SONET node, and part of our regional data center but I worked in infrastructure doing software development. My proximity[0] meant I ended up being used to support larger infrastructure efforts at times, but it usually just meant I got a lot of good ... stories.

I wonder if there's a collection of Data Center centric "Daily WTF" stories or something similar.

For me, I think my favorite is when we had multiple unexplained power failures very late at night in our test/management DC[0]. It turned out "the big red kill switch" button behind the plexiglass thing designed to make sure someone doesn't accidentally "lean into it and shut everything off" was mistaken for the "master light switch" by the late night cleaning crew. Nobody thought about "the cleaning crew" because none of the other DCs allowed cleaning crew anywhere near them but this was a test switch (someone forgot about the other little detail). If memory serves, it took a few outages before they figured it out. The facilities manager actually hung around one night trying to witness it only to have the problem not happen (because the cleaning lady didn't turn the lights off when people were there, duh!). I'd like to say that it was almost a "maybe bugs/animals/ghosts are doing it" impulse that caused them to check the cameras but it was probably also the pattern being recognized as "days coinciding with times that the late night cleaning crew does their work."

Outside of that, the guy who made off with something like 4 of these legacy switch cards because some fool put a door stop on the door while moving some equipment in was probably really excited when he found out they were valued at "more than a car" but really disappointed when he put them on eBay for something like $20,000 (which was, I wanna say at least a 50% markdown), was quickly noticed by the supplier[2] was arrested and we were awaiting the return of our hardware.

[0] Among many other things due to a diverse 17-year career, there, but mostly just because I was cooperative/generally A-OK with doing things that "were far from my job" when I could help out my broader organization.

[1] When that went down, we couldn't connect to the management interfaces of any of the devices "in the network". It's bad.

[2] Alarms went off somewhere -- these guys know if you are using their crap, you're stuck with their crap and they really want you stuck paying them for their crap. I'm fairly certain the devices we used wouldn't even function outside of our network but I don't remember the specifics. AFAIK, there's no "pre-owned/liquidation-related market" except for stripping for parts/metals. When these things show up in unofficial channels, they're almost certainly hot.