I’m building an observability system that tries to surface answers instead of making people dig through huge amounts of raw telemetry.
The basic idea is that when one failure fans out across 20 services, you often end up with 20 alerts and 20 separate investigations, even though there is really just one root cause. I’m using distributed tracing to build a live model of how errors propagate through the system, and then exposing that context directly at each affected service.
Longer term, I want this to become a very high-precision RCA engine. Right now I’m looking to try it with a few early design partners that already have a lot of tracing data, especially OpenTelemetry or Datadog APM users. I'll love to chat with some folks who would be willing to try it out!
[dead]