There are ways to scale Prometheus (look at Thanos), but none of the solutions is really bug free.
See this PR for example (https://github.com/prometheus/prometheus/pull/18364) - this used to impact a production deployment I worked on. Prometheus, Thanos and even OpenTelemetry are full of those kind of problems - but at the same time it's the best we have and we should be grateful they're free and open source.
I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co.
Please don't take my words lightly, I worked with the rest of my team in a large scale observability platform and scalability should not be underestimated - at the same time DataDog / Splunk prices are simply unjustified. It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).
> It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).
Can you show the math here? This is a very bold claim, and I’m super curious. A shared Google Sheet would work well.