I think in part the issue is that the LLM does not have enough context. The difference between a bug in the test or a bug in the implementation is purely based on the requirements which are often not in the source code and stored somewhere else(ticket system, documentation platform).
Without providing the actual feature requirements to the LLM(or the developer) it is impossible to determine which is wrong.
Which is why I think it is also sort of stupid by having the LLM generate tests by just giving it access to the implementation. That is at best testing the implementation as it is, but tests should be based on the requirements.
Oh, absolutely, context matters a lot. But the thing is, they still fail even with solid context.
Before I let an agent touch code, I spell out the issue/feature and have it write two markdown files - strategy.md and progress.md (with the execution order of changes) inside a feat_{id} directory. Once I’m happy with those, I wipe the context and start fresh: feed it the original feature definition + the docs, then tell it to implement by pulling in the right source code context. So by the time any code gets touched, there’s already ~80k tokens in play. And yet, the same confusion frequently happens.
Even if I flat out say “the issue is in the test/logic,”, even if I point out _exactly_ what the issue is, it just apologizes and loops.
At that point I stop it, make it record the failure in the markdown doc, reset context, and let it reload the feature plus the previous agent’s failure. Occasionally that works, but usually once it’s in that state, I have to step in and do it myself.