+1 to the CI/isolation point. That is the part that makes these setups work for me too: make the failure cheap to reproduce, make stderr visible, make the agent rerun the same command after the patch. A lot of bad agent behavior is really just "it never got a clean signal".

The part that still bites me is across sessions. A tight loop fixes this run, but next week the agent can walk into the same rake again: same wrong import path, same misuse of an internal API, same CI-only dependency issue. After patching the same class of failure a few times, I started writing those down outside the chat context so the next run sees the failure pattern before it guesses.