But does it work? I’ve used LLMs for log analysis and they have been prone to hallucinate reasons: depending on the logs the distance between cause and effects can be larger than context, usually we’re dealing with multiple failures at once for things to go badly wrong, and plenty of benign issues throw scary sounding errors.

Post author here.

Yes, it works really well.

1) The latest models are radically better at this. We noticed a massive improvement in quality starting with Sonnet 4.5

2) The context issue is real. We solve this by using sub agents that read through logs and return only relevant bits to the parent agent’s context

So you’re not getting alerts at 2 am from hallucinations?

I would be very interested in reading about this kind of orchestration and filtering than data acquisition if you have the energy for another post :)

We started writing very recently: https://www.mendral.com/blog - there is a another post we made yesterday about the overall architecture. And we have a long list of things we're planning to write about in more details.

Taking good note of your comment :)

We've actually started to gather metrics this week to write that exact post :) Coming soon!

[dead]

[dead]

Mendral co-founder here, we built this infra to have our agent detect CI issues like flaky tests and fix them. Observing logs are useful to detect anomalies but we also use those to confirm a fix after the agent opens a PR (we have long coding sessions that verifies a fixe and re-run the CI if needed, all in the same agent loop).

So yes it works, we have customers in production.

It can, like all the other tasks, it's not magic and you need to make the job of the agent easier by giving it good instructions, tools, and environments. It's exactly the same thing that makes the life of humans easier too.

This post is a case study that shows one way to do this for a specific task. We found an RCA to a long-standing problem with our dev boxes this week using Ai. I fed Gemini Deep Research a few logs and our tech stack, it came back with an explanation of the underlying interactions, debugging commands, and the most likely fix. It was spot on, GDR is one of the best debugging tools for problems where you don't have full understanding.

If you are curious, and perhaps a PSA, the issue was that Docker and Tailscale were competing on IP table updates, and in rare circumstances (one dev, once every few weeks), Docker DNS would get borked. The fix is to ignore Docker managed interfaces in NetworkManager so Tailscale stops trying to do things with them.

> it's not magic and you need to make the job of the agent easier by giving it good instructions, tools, and environments.

This. We had much better success by letting the agent pull context rather trying to push what we thought was relevant.

Turns out it's exactly like a human: if you push the wrong context, it'll influence them to follow the wrong pattern.

I'd put it somewhere in the middle, but closer to the pull end.

- I force the AGENTS.md into the system prompt if the agent reads a directory, or file within, that contains one such file. This is anecdotally very good and saves on function calls and context growth in multiple ways. Sort them. I'm now doing this with planning and long-term task tracking markdown files.

- Everything else is pull, ideally be search, yet to substantially leverage subagents for context gathering. Savings elsewhere have pushed the need out.

btw, hi Al, I see you are working on a new company since our last collaboration, want to catch up sometime and talk shop?

Thanks - that’s the maddening with flakes - is it the thing under test or the thing doing the testing? Hermeticity is a lie we tell ourselves :)

Honestly, with recent models, these types of tasks are very much possible. Now it mostly depends on whether you are using the model correctly or not.