I have had some success with /goal for long tasks that can be set up in a way that the agent can do good work for an extended period of time.

A lot of tasks aren't amenable to that, and the ones that are still need a lot of care to be set up correctly. The default vibe coded codebase won't be.

I've come to think of the activity of choosing the right technology, the right architecture, the right testing setup, the right context, and the right /goals to use as programming the agent.

How much does /goal actually help? In auto mode, I've tried using and not using /goal and I haven't felt a difference.

https://code.claude.com/docs/en/goal#how-evaluation-works

> /goal is a wrapper around a session-scoped prompt-based Stop hook. Each time Claude finishes a turn, the condition and the conversation so far are sent to your configured small fast model, which defaults to Haiku. The model returns a yes-or-no decision and a short reason. A “no” tells Claude to keep working and includes the reason as guidance for the next turn. A “yes” clears the goal and records an achieved entry in the transcript.

> The evaluator runs on whichever provider your session is configured for. It does not call tools, so it can only judge what Claude has already surfaced in the conversation.

Apparently, it uses Haiku (by default) to evaluate every turn to determine if the goal has been achieved. However, it only relies on the transcript itself (including the reasoning of the main model). It can't independently verify if the goal has been achieved. So, if the main model thinks the goal is or isn't done, how often does Haiku disagree (in a productive way)? That's not clear to me.

I've mainly used the feature in codex, where I've been able to get it to work for 5 straight days (with breaks when rate limits are hit -- which was surprisingly only thrice) on a massive port.

I don't know how well it works in claude code, but I wouldn't be worried about Haiku getting it wrong and don't see a problem with it relying on the transcript. I always set these things up to maintain a checklist of subtasks to do in a file and check them off, and to always implement with red/green testing methodology, where it writes and commits failing tests, then writes the feature/fixes the bug and commits with passing tests and with an updated checklist file.

So the model should always know from the transcript whether the current task is done by whether it shows the tests passing, and it should always know if there's more tasks left from the checklist file being updated before the commit.