I've mainly used the feature in codex, where I've been able to get it to work for 5 straight days (with breaks when rate limits are hit -- which was surprisingly only thrice) on a massive port.

I don't know how well it works in claude code, but I wouldn't be worried about Haiku getting it wrong and don't see a problem with it relying on the transcript. I always set these things up to maintain a checklist of subtasks to do in a file and check them off, and to always implement with red/green testing methodology, where it writes and commits failing tests, then writes the feature/fixes the bug and commits with passing tests and with an updated checklist file.

So the model should always know from the transcript whether the current task is done by whether it shows the tests passing, and it should always know if there's more tasks left from the checklist file being updated before the commit.