It's interesting to revisit Brooks' "surgical team" in light of AI. For example, I frequently have Claude act as a "toolsmith", creating bespoke project-specific tools on the fly, which are then documented in Skills that Claude can use going forward. What has changed is that a) One person (or rather, one person-AI hybrid) plays all the roles within the surgical team, and b) Internal frictions such as cost, development time, and communication overhead have all been dramatically slashed.
> frequently have Claude act as a "toolsmith", creating bespoke project-specific tools on the fly, which are then documented in Skills that Claude can use going forward.
I also do this.
e.g. after watching Claude burn tokens building and then deploying a docker image multiple times (and it taking extra time), I asked it to just create a build.and.deploy.sh script. I also then have a test.deploy.sh script that Claude can use to confirm everything worked.
Saves a ton of time/tokens AND has the added benefit of being usable by me or other humans when doing manual tests or debugging outages etc.
I do something similar, but tell the agent to write a recipe into a justfile. Then it can run `just` and get a self-documenting list of all the tooling for the project (just build, just test, etc.)
How well does that work for you ? It's annoyingly inconsistent for me - I give it instructions on how to fetch JIRA ticket with a script that renders everything relevant to a .md and half of the time it will still default to reading it via ACLI. I have instructions on how to do a full build with warnaserror before commit but I still get pipeline errors regularly because it will skip the noincremental part, etc.
I have a harness for Claude Code "hooks" (https://code.claude.com/docs/en/hooks) which in my case execute a Go tool in a separate project which runs changes made by claude through a validator with various rules that can be defined (regex, semgrep, etc.). They can warn claude or they can block changes outright.
When I find claude is using tools or approaches that I have replaced with more specific ones, I ask claude to add a hook to prevent doing this in the future and point it to the instructions of what to do instead.
And of course I wrapped all that up in a Skill so it knows what approaches to take to add things to hooks.
It becomes fairly trivial to incrementally stop it making repeated mistakes like this.
I've had that happen before too, and I just added a line to CLAUDE.md or AGENTS.md something like (adapted to your example):
Claude has gotten better about following CLAUDE.md over the last year (it was pretty laughably bad at it previously).I have that both in the skill and in CLAUDE.md but it's not reliable - and polluting CLAUDE.md with task specific instructions kind of sucks.
I have the same issue too, 99% of the it's for two reasons
1) It tried the tool, but for some reason it worked unexpectedly and Claude is VERY good at working around problems, it won't just stop.
2) Context got too long so those rules were "forgotten"
It only released just over a year ago…
You may want to try out pi-agent and create custom extensions instead.
Then codify this behavior into a process which automatically gets run through.
I.e. $repo/origin as bare repo, then prompt to create a shell script which creates the worktree and cds into it, running the script you mentioned, instantiating pi in it. Potentially define explicit phases for your workflow and show the phase in the UI - and quality gates for transitions. Eg force the implment to finalize phase to only happen if all tests succeeded. Potentially add multiple review phases here too, with different prompts. This progressively gets rid of more and more inconsistencies.
Still not a perfect solution, but on average I've had less and less to manually address with that workflow. Albeit at cost of tokens (multiple reviews phases obviously ingest all changes multiple time)
Pi-agents extensibility is just a lot better then the other harnesses, but you could obviously also just introduce a different orchestrator to do the same. For me, pi-agent was just the least amount of effort necessary to get it going.
On a local model, with open code, when I wrote a specific javascript way to run sql queries because bash and psql were error prone, what I did was when I saw it make a mistake, I told it in passive agressive tones something like: "please edit AGENTS.md to detail how to use the query.js tool to run a query and to never use psql"; I did this two times until it stopped wanting to use psql.
It seems like if you write the docs yourself that's not leveraging the probability that the model itself knows the anti-context guard rail that best prevents it from grabbing its average tool use.