At KubeCon Europe a very good chunk of booths were observability stacks. Everyone was claiming they're better than the competitors (with some of the just justifying themselves by saying "it's written in Rust).

Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.

If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.

Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)

I was thinking this might be a result of the Cheap-money (post covid) era ending and everyone scrambling to reduce their Datadog/Cloud costs. Thinking back on 2023/2024, lots of companies were leaking large amounts of capital to those vendors and I imagine lots of people saw an opportunity for creating leaner and cheaper stacks.

This is my instinct too. I've had the pleasure of using DataDog and the pain of negotiating with their salespeople!

Yes. Their sales people don’t even negotiate - they just tell you this is price and done. Dunno why they need sales person if prices are non-negotiable

[dead]

I have tried to self host grafana (loki prom and alloy) as o11y stack for prepbook.app. This is hard. I have a bsc in cs not that it says something. I managed to do it eventually, after some research. It was not plug and play in any way. The docs kept saying this solution is not production ready even. I couldn't find the production guide, only the "forget about self hosting and simply pay for us hosting this". After I deployed it the UX was so abrasive my partner won't even try to go into it to figure out a problem. It was a few months ago. Since then new solutions have arrived and I'm waiting to have the time to migrate. I saw PostHog have a solution but I prefer something I could self host and completely own.

I thought how come no one is trying to solve this problem. It looks like it's just a matter of time.

With that being said, my experience can be very skewed since prepbook is a passion project running on a VPS with essentially 0 scale. All I care about is the UX of the stack, not scale. Just for context.

FWIW, I have no CS degree and barely attended school at all, and found Grafana + Prometheus + Loki fairly easy to setup, at least compared to what we used to use before those tools were available. Maybe it's because I used NixOS for the setup, but besides learning some new domain-specific things I didn't know since before, I don't recall hitting any particular bumps or roadblocks, I also went the 100% self-hosted route (spread across two hosts at home).

What exactly were you struggling with when it came to the setup? Just a ton of new concepts to learn which took time, or something specific to Grafana/Prometheus/Loki?

"Getting it running" is the easy part.

"Getting it ready for production" is a different game.

I've fallen on my sword many times by trying to explain that prometheus fails every metric of production ready; in fact Google themselves replaced borgmon (prometheus) for Monarch because the "tiny unreliable time series databases everywhere" was in fact, not the successful and reliable deployment strategy that they had claimed.

But, it is very easy to set up. Just don't go looking for failure modes, because they're everywhere and every single one of them is catastrophic.

There are ways to scale Prometheus (look at Thanos), but none of the solutions is really bug free.

See this PR for example (https://github.com/prometheus/prometheus/pull/18364) - this used to impact a production deployment I worked on. Prometheus, Thanos and even OpenTelemetry are full of those kind of problems - but at the same time it's the best we have and we should be grateful they're free and open source.

I'd still choose an open source stack (and contribute to it) rather than go for a proprietary solution - we've all seen what happens with DataDog & co.

Please don't take my words lightly, I worked with the rest of my team in a large scale observability platform and scalability should not be underestimated - at the same time DataDog / Splunk prices are simply unjustified. It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).

> It's ironically cheaper to build a team of engineers that will maintain a sane observability stack instead of feeding the monster(s).

Can you show the math here? This is a very bold claim, and I’m super curious. A shared Google Sheet would work well.

Well, I am running the stack in production right now, but everyone has a different understanding of what that actually means...

Do you have concrete examples of these catastrophic failures? I've personally havent experienced any myself during these years, but I'm doing very boring and typical stuff, so wouldn't surprise me there was hard edges still.

There's a difficult distinction here, you're right.

Technically even a single server running LAMP as root but taking frontend traffic meets the definition of in production but I think we all recognise that it's not the right idea.

What I'm referring to is: should the disk start to have issues: what does prometheus do? If the scrapers start to stall due to connection timeouts: what does prometheus do? If you are doing linear interpolation of data and you have massive gaps because you're polling opportunisitically: what does prometheus do.

I'm all about boring technology, but prometheus assumes too much happy path. It assumes that a single node is enough for time series data that is used for alerting.

Which, it is: at very small scale and with best effort reliability.

It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or to infer issues that need to be correlated across multiple systems.

> should the disk start to have issues

If that happens, is prometheus really the biggest of your worries here? Software breaks left and right when disks disappear from under them, I'm not sure this is neither unexpected or unique to prometheus.

> If the scrapers start to stall due to connection timeouts: what does prometheus do?

I'm having this "issue" all the time, as some of my WiFi connected (less important) cameras are just within the WiFi range, and I'm using prometheus to scrape metrics from them. It seems like the requests times out, then the next time it doesn't, and everything just works? What's the issue you're experiencing with this exactly?

> It's not acceptable as soon as lost data could be critically important in diagnosing major issues in billing systems, or actually billing users, or

Wait what? Billing systems? That stuff would go into your proper database, wouldn't it? Sure, if prometheus/node_exporter fails or whatever, you won't get metrics out of the host, but again, if those things start failing on that host, the host is having bigger issues than "prometheus suck at scale".

I was eagerly awaiting to be educated about potential gaps in my understanding of prometheus, instead it seems like you simply don't happen to like they way they do things? I was under the impression they did something wrong or something was broken, but these things just seems like the typical stuff you have to think about for any service you deploy.

Yes, my monitoring system not alerting me when the systems it runs on are failing is the entire problem.

That's not a general "software breaks when disks fail" situation: that's a monitoring system failing at its one job.

Your monitoring system failing silently when your infrastructure is under stress is precisely the failure mode that monitoring exists to prevent.

Zabbix solves this with native HA and self-checks. Prometheus makes it your problem to solve with external tooling, and most people don't, until they need it.

Why wouldn't your monitoring system alert you when metrics suddenly disappear? Sounds like you need a better monitoring system, prometheus is not gonna magically solve that problem for you. No wonder you were having issues with prometheus...

I'm not sure what you mean.

Of course the systems that have to alert me to failure have to be designed with mechanisms to alert me to the fact that they themselves are failing.

Zabbix, Nagios, Munin -- practically everything that existed before: understood this.

Prometheus doesn't because it optimised intentionally for being easy to deploy and for there being a hierarchy of prometheus's in a tree-like formation. Which makes sense, but forces a much more distributed and difficult to reason model.

Monitoring systems can't be designed for the happy path. By definition, they only matter when things are going wrong- which is precisely when the happy path isn't available. Prometheus is excellent when everything is fine (scaling aside). That's not when you need your monitoring system to be excellent.

Do you think Prometheus + Grafana is the way to go?

Really depends on the use case. Home lab? Probably.

Production? As soon as you scale you need a proper solution. Prometheus (by itself) doesn't scale - you need Mimir or Thanos (or similar).

Clickhouse (the "clickstack") seems to be the new kid on the block. Looks very promising.

Is "observability stack" the new term for logs and stats?