If you’re having a correlated outage like that, then it’s likely you fix the prod issue before the cloud engineers at some giant cloud company even respond to an internal escalation much less fixes an issue. More than likely your prod issue is causing the logging problem.

If you mean you are experiencing two totally unrelated issues at the same time, then I don’t think that’s a reasonable thing to really assign much value to as it’s incredibly unlikely.

Half of $30k/mo trivially pays for an engineer you hire to only manage such a cluster for you and just works an hour a week unless a pager goes off if you truly need that level of peace of mind. If you’re hiring for such a position I have a few rock star level folks who would love such a job.

The hypothetical problems people imagine for on-prem infrastructure get really strange to me. I could come up with the same sort of scenarios for cloud based SaaS infrastructure just as easily.

> I don’t think that’s a reasonable thing to really assign much value to as it’s incredibly unlikely.

In my experience the systems/tools needed to debug production issues are often only used when they’re needed.

Which now means you need health and uptime monitoring on your log server since without that, it might break randomly and no one notices until you need it.

> The hypothetical problems people imagine for on-prem infrastructure get really strange to me

It really comes down to the people and whether you have the expertise on the team. And whether the team can realistically manage the system long term. It’s typically safer to spend more money for the managed service.

(It’s a safer decision, not necessarily better)

> It really comes down to the people and whether you have the expertise on the team

Aren't these people suppose to debug and fix complex problems in prod? And if they can do that, why can't they run and debug a log server?

Of course there are trade offs with any outsourcing decision. But I think we should have higher expectations of engineers

I don’t think it’s necessarily safer or better for anything but your job security.

100% agree. If I am using a cloud log provider I wouldn't expect them to solve my logging issue(s) as fast as I need, more importantly I have no real way to put more resources on that fix.

More importantly, with a third party service I'd be very surprised if both went down at the same time and it wasn't a further upstream issue like AWS. If its my own logging service and it went down during a prod outage, I likely didn't properly isolate my logging service in the first place.

> Half of $30k/mo trivially pays for an engineer you hire to only manage such a cluster for you and just works an hour a week unless a pager goes off if you truly need that level of peace of mind. If you’re hiring for such a position I have a few rock star level folks who would love such a job.

1 person? Is that person always on call?

Yep, absolutely. I’ve come up with the term “man on the mountain” for such positions.

It’s when one person is exceedingly talented at exactly one thing - but isn’t exactly a typical employee who is good or interested in doing much else other than keeping that one thing online and reliable.

Their job is to go live on their mountain for weeks or months at a time without so much as doing anything other than keeping their phone on and answering it within the first couple rings regardless of when called. If they are good at their job you likely don’t even need to call - they already know it’s broken before you do.

I’ve employed a few such folks over my career. They tend to be the “alternative” style candidate - exceptional people with exceptional flaws. They love the simple tradeoff.

That said of course this is ignoring bus factor and overly simplifying things. Typically this is one deep subject level matter expert who sits off on the side of a small team, so there is at least one “understudy” hanging around as well.

I still advocate for such positions when they make sense though. I would much rather in-house my own “insurance” vs overpay some giant company for each month only to find out the insurance didn’t exist when I needed to make a claim. It’s certainly more risk to my career - but I have very strong feelings that as a manager or executive my job is NOT to cover my own ass because it’s easier.