> If your database goes down at 3 AM, you need to fix it.

Of all the places I've worked that had the attitude "If this goes down at 3AM, we need to fix it immediately", there was only one where that was actually justifiable from a business perspective. I'm worked at plenty of places that had this attitude despite the fact that overnight traffic was minimal and nothing bad actually happened if a few clients had to wait until business hours for a fix.

I wonder if some of the preference for big-name cloud infrastructure comes from the fact that during an outage, employees can just say "AWS (or whatever) is having an outage, there's nothing we can do" vs. being expected to actually fix it

From this perspective, the ability to fix problems more quickly when self hosting could be considered an antifeature from the perspective of the employee getting woken up at 3am

The worst SEV calls are the one where you twiddle your thumbs waiting for a support rep to drop a crumb of information about the provider outage.

You wake up. It's not your fault. You're helpless to solve it.

Not when that provider is AWS and the outage is hitting news websites. You share the link to AWS being down and go back to sleep.

News is one thing, if the app/service down impacts revenue, safety or security you won't be getting any sleep AWS or not.

No. You sit on the call and wait to restore your service to your users. There’s bullshit toil in disabling scale in as the outage gets longer.

Eventually, AWS has a VP of something dial in to your call to apologize. They’re unprepared and offer no new information. The get handed to a side call for executive bullshit.

AWS comes back. Your support rep only vaguely knows what’s going on. Your system serves some errors but digs out.

Then you go to sleep.

This is also the basis for most SaaS purchases by large corporations. The old "Nobody gets fired for choosing IBM."

Really? That might be an anecdote sampled from unusually small businesses, then. Between myself and most peers I’ve ever talked to about availability, I heard an overwhelming majority of folks describe systems that really did need to be up 24/7 with high availability, and thus needed fast 24/7 incident response.

That includes big and small businesses, SaaS and non-SaaS, high scale (5M+rps) to tiny scale (100s-10krps), and all sorts of different markets and user bases. Even at the companies that were not staffed or providing a user service over night, overnight outages were immediately noticed because on average, more than one external integration/backfill/migration job was running at any time. Sure, “overnight on call” at small places like that was more “reports are hardcoded to email Bob if they hit an exception, and integration customers either know Bob’s phone number or how to ask their operations contact to call Bob”, but those are still environments where off-hours uptime and fast resolution of incidents was expected.

Between me, my colleagues, and friends/peers whose stories I know, that’s an N of high dozens to low hundreds.

What am I missing?

> What am I missing?

IME the need for 24x7 for B2B apps is largely driven by global customer scope. If you have customers in North American and Asia, now you need 24x7 (and x365 because of little holiday overlap).

That being said, there are a number of B2B apps/industries where global scope is not a thing. For example, many providers who operate in the $4.9 trillion US healthcare market do not have any international users. Similarly the $1.5 trillion (revenue) US real estate market. There are states where one could operate where healthcare spending is over $100B annually. Banks. Securities markets. Lots of things do not have 24x7 business requirements.

I’ve worked for banks, multiple large and small US healthcare-related companies, and businesses that didn’t use their software when they were closed for the night.

All of those places needed their backend systems to be up 24/7. The banks ran reports and cleared funds with nightly batches—hundreds of jobs a night for even small banking networks. The healthcare companies needed to receive claims and process patient updates (e.g. your provider’s EMR is updated if you die or have an emergency visit with another provider you authorized for records sharing—and no, this is not handled by SaaS EMRs in many cases) over night so that their systems were up to date when they next opened for business. The “regular” businesses closed for the night generated reports and frequently had IT staff doing migrations, or senior staff working on something at midnight due the next day (when the head of marketing is burning the midnight oil on that presentation, you don’t want to be the person explaining that she can’t do it because the file server hosting the assets is down all the time after hours).

And again, that’s the norm I’ve heard described from nearly everyone in software/IT that I know: most businesses expect (and are willing to pay for or at least insist on) 24/7 uptime for their computer systems. That seems true across the board: for big/small/open/closed-off-hours/international/single-timezone businesses alike.

You are right that a lot of systems at a lot of places need 24x7. Obviously.

But there are also a not-insignificant number of important systems where nobody is on a pager, where there is no call rotation[1]. Computers are much more reliable than they were even 20 years ago. It is an Acceptable Business Choice to not have 24x7 monitoring for some subset of systems.

Until very recently[2], Citibank took their public website/user portal offline for hours a week.

1 - if a system does not have a fully staffed call rotation with escalations, it's not prepared for a real off-hours uptime challenge 2 - they may still do this, but I don't have a way to verify right now.

This lasts right up until an important customer can't access your services. Executives don't care about downtime until they have it, then they suddenly care a lot.

You can often have services available for VIPs, and be down for the public.

Unless there's a misconfiguration, usually apps are always visible internally to staff, so there's an existing methodology to follow to make them visible to VIPs.

But sometimes none of that is necessary. I've seen at a 1B market cap company, a failure case where the solution was manual execution by customer success reps while the computers were down. It was slower, but not many people complained that their reports took 10 minutes to arrive after being parsed by Eye Ball Mk 1s, instead of the 1 minute of wait time they were used to.

Thousands of orgs have full stack OT/CI apps/services that must run 24/7 365 and are run fully on premise.

Uptime is also a sales and marketing point, regardless of real-world usage. Business folks in service-providing companies will usually expect high availability by default, only tempered by the cost and reality of more nines.

Also, in addition to perception/reputation issues, B2B contracts typically include an SLA, and nobody wants to be in breach of contract.

I think the parent you're replying to is wrong, because I've worked at small companies selling into large enterprise, and the expectation is basically 24/7 service availability, regardless of industry.