When I was an SRE at Google our oncall was extremely serious (if the service went down, Google was unable to show ads, record ad impressions, or do any billing for ads). It was done on a rotation, lasted 1 week (IIRC it was 9AM-9PM, we had another time zone for the alternate 12 hours). The on-call was empowered to do pretty much anything required to keep the service up and running, including cancelling scheduled downtimes, pausing deployment updates, stop abusive jobs, stop abusive developers, and invoke an SVP if there was a fight with another important group).

We sent a test page periodically to make sure the pager actually beeped. We got paid extra for being in the rotation. The leadership knew this was a critical step. Unfortunately, much of our tooling was terrible, which would cause false pages, or failed critical operations, all too frequently.

I later worked on SWE teams that didn't take dev oncall very seriously. At my current job, we have an oncall, but it's best effort business hours only.

>empowered to do pretty much anything required to keep the service up and running,

Is that really uncommon? I've been on call for many companies and many types of institutions and never been told once I couldn't do something to bring a system up that I can recall at least. Its kinda the job?

On call seriousness should be directly proportional to pay. Google pays. If smallcorp want to pay me COL I'll be looking at that 2AM ticket at 9AM when I get to work.

That’s pretty good. Our oncall is actually 24-hour for one week. On paper it looks very serious but even the best of us don’t really know everything so issues tend to lag to the morning. Neither do we get any compensation for it. Someone got a bad night and still need to logon next day. There is an informal understanding to relax a bit if the night is too bad, though.

I did 24hr-for-a-week oncall for 10+ years, do not recommend.

12-12 rotation in SRE is a lot more reasonable for humans

Unfortunately 24hr-for-a-week seems to be default everywhere nowdays, its just not practical for serious type businesses. It just an indicator of how important is the UPTIME for a company.

I agree. It sucks. And our schedule is actually 2 weeks in every five. One is secondary and the other is primary.

Handling my first non-prod alert bug as the oncall at Google was pretty eye opening :)

It was a good lesson in what a manicured lower environment can do for you.