On-call issues are staffing issues not tools issues.

Inadequate staff is the only reason on-call exists. Sure, people might be mostly sitting around all night being paid and not being terribly busy.

But if a company needs someone at night, they need someone at night. Companies getting away with not paying for that is why oncall sucks.

In other words oncall sucks because companies don’t pay for solving the problems that require it. There’s no self correcting feedback.

A tool can’t fix that and oncall is not inevitable. Good luck.

I assume it's part of the pay. You can't be a firefighter or a cop and then complain that there's night shifts. I've had nearly 4 years of it at a payment gateway and IIRC only one time was there something that had to be solved that night. When it happened, it was sort of my fault anyway; a good deal of the problems are (should be?) within the control of the people being on-call. And I think companies like payment gateways and cloud services which need people active at all times are also far more tolerant of things like spending a week reviewing a PR and such, so the frequency of downtime is lower even if the impact is much higher.

Though I'd agree it's a staffing issue. 5 people in a cycle is fine. If you had a concert or something that week, just swap places with a colleague. When we reduced it to 2 people, it was not cool to spend half your time on-call.

There's also policies like don't release on Fridays, don't release on a vacation week. If there's a tool for it, it would be flagging these behaviors. Unfortunately, we can't really control when partners go down.

I used to work the night shift handling most off hours issues for a a couple dozen teams. We would occasionally have to call someone, but not that often compared to the alternative. Most of the time it was just to get sign off on what we already planned to do.

When I started people were paid for any hours they worked on-call. By the end, the company changed the policy so on-call was part of base pay. For those who were on-call during the change over, their last year of on-call pay was averaged and added to their salary. For everyone who came after that, they got screwed (that includes me).

Once I changed to the day shift I got called a few times for on-call. Every single time, I documented what I did to fix it, as I did it, and handed it off to the ops team. Or in some cases I automated the fix. I have 0 tolerance for being called in my free time. I don’t care what the boss says my priorities are, if I’m being called at night, stopping that in its tracks is my #1 priority. If I ever get called two times for the same issue, that’s my fault. So far, it’s never happened.

> When I started people were paid for any hours they worked on-call

I've yet to hear of any alternative compensation model that actually works. Just pay people in their choice of money or time off in lieu. Sorry to hear you got screwed.

> Every single time, I documented what I did to fix it, as I did it, and handed it off to the ops team. Or in some cases I automated the fix. I have 0 tolerance for being called in my free time. I don’t care what the boss says my priorities are, if I’m being called at night, stopping that in its tracks is my #1 priority.

100% agree, I think people are far too tolerant of being paged. Especially management - the productivity impact of constant interrupts is huge. In a previous job one of my favourite things to do was go out to teams and just disable alerts they said were noisy or unactionable. If there was any pushback/consequence I was happy to accept responsibility (but never had to).

Disabling non-actionable alerts actually lowered the error rate in my experience, because people would start paying attention to the alerts. Even if they were being lazy, they'd be able to see a pattern after getting rid of the noise.

Exactly! Cut the noise, boost the signal. Every alert outside business hours should mean "drop everything and investigate this". Otherwise it can wait until the morning.

I think we somewhat agree.. Uncompensated on-call is not acceptable. Even if you're not busy, there is an ever-present burden to knowing you could be interrupted at any moment that takes a toll on your personal time.

But as long as the expected cost of downtime outweighs the financial cost of keeping someone available to fix it, on-call in some form will be inevitable. (There are a lot of instances where the cost doesn't make sense, and we should just accept the system being broken until 9am)

I don't think on-call needs to suck though. IMO "staffing issues" (whether it's headcount, time, competing priorities, etc) are resourcing issues and I believe better tooling can absolutely help with that - either by reducing the resources required to fix it or by making the cost of the issues quantifiable. Thanks for the good luck :)