What’s the failure mode you see with single-agent Claude Code on complex tasks? (looping, context drift, plan collapse, tool misuse?)
What’s the failure mode you see with single-agent Claude Code on complex tasks? (looping, context drift, plan collapse, tool misuse?)
All of the above. The most frustrating one with the Putnam example with Claude was generating solutions that obviously didn't compile. This feels like plan collapse- not verifying its own work. I'm sure that if you just had a dumb two-model setup, it would eventually get to compiling code after n runs, but that was just for this one failure mode.
You can use hooks to not allow it to stop without successful build