Yes, my monitoring system not alerting me when the systems it runs on are failing is the entire problem.

That's not a general "software breaks when disks fail" situation: that's a monitoring system failing at its one job.

Your monitoring system failing silently when your infrastructure is under stress is precisely the failure mode that monitoring exists to prevent.

Zabbix solves this with native HA and self-checks. Prometheus makes it your problem to solve with external tooling, and most people don't, until they need it.

Why wouldn't your monitoring system alert you when metrics suddenly disappear? Sounds like you need a better monitoring system, prometheus is not gonna magically solve that problem for you. No wonder you were having issues with prometheus...

I'm not sure what you mean.

Of course the systems that have to alert me to failure have to be designed with mechanisms to alert me to the fact that they themselves are failing.

Zabbix, Nagios, Munin -- practically everything that existed before: understood this.

Prometheus doesn't because it optimised intentionally for being easy to deploy and for there being a hierarchy of prometheus's in a tree-like formation. Which makes sense, but forces a much more distributed and difficult to reason model.

Monitoring systems can't be designed for the happy path. By definition, they only matter when things are going wrong- which is precisely when the happy path isn't available. Prometheus is excellent when everything is fine (scaling aside). That's not when you need your monitoring system to be excellent.

I think we're running really different monitoring setups, I'd never expect my alerting solution to still be able to alert to me if it's down or degraded, nor would I expect my metrics gathering software to alert me if it's down, that's why I have monitoring setup for those things in the first place.

But, I'm sure your setup makes as much sense in your context as mine makes in my context. As long as it works for you, we're all happy :)

"I have monitoring set up for those things" - but that doesn't solve the ambiguity. When Prometheus misses a scrape, nothing fires. Silence looks identical whether your service is down, the network blipped, or Prometheus itself is struggling. A defensive monitoring system has to treat absence of data as a signal, not just absence of a problem.