Congratulations! The difference between pure agentic exploration and deterministic steps is spot on. Runbooks give ops more confidence on the data exploration and save time/context.
Curious how much savings do you observe from using runbook versus purely let Claude do the planning at first. Also how the runbooks can self heal if results from some steps in the middle are not expected.
>> how the runbooks can self heal if results from some steps in the middle are not expected.
Yeah this is a very interesting angle. Our primary mechanism here is via agent created auto-memories today. The agent keeps track of the most useful steps, and more importantly, dead end steps as it executes runbooks. We think this offers a great bridge to suggest runbook updates and keep them current.
>> Curious how much savings do you observe from using runbook versus purely let Claude do the planning at first.
Really depends on runbook quality, so I don't have a straightforward answer. Of course, it's faster and cheaper if you have well defined steps in your runbooks. As an example, `check logs for service frontend, faceted by host_name`, vs. `check logs`. Agent does more exploration in the latter case.
We wrote about the LLM costs of investigating production alerts more generally here, in case helpful: https://relvy.ai/blog/llm-cost-of-ai-sre-investigating-produ...
Re: savings - it depends on the use case. For example, one of our users set up a small runbook to run a group-by-IP query for high-throughput alerts, since that was their most common first response to those alerts. That alone cuts out a couple of minutes of exploration per incident and removes the variability of the agent deciding what data to investigate and how to slice it.
In our experience, runbooks provide a consistent, fast, and reliable way of investigating incidents (or ruling out common causes). In their absence, the AI does its usual open-ended exploration.