The trick is, with the setup I mentioned, you change the rewards.
The concept is:
Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.
Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.
Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.
It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.
You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.
What kind of setup do you use ? Can you share ? How much does it cost ?
rlm-workflow does all that TDD for you: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow
(I built it)
Why make powershell a requirement? I like powershell, but Python is very common and already installed on many dev systems.
Sorry about that. Let me push an update.
Thanks for sharing. What does RLM stand for? Any idea why the socket security test fails?
Recursive language models: https://github.com/doubleuuser/rlm-workflow
If you are not spending 5-10k dollars a month for interesting projects, you likely won't see interesting results
Sounds a lot like paying for online ads, they don't work because you're not paying enough, when in reality bots, scrapers and now agents are just running up all the clicks.
You pay more to try and get above that noise and hope you'll reach an actual human.
The new "fast mode" that burns tokens at 6 times the rate is just scary because that's what everyone still soon say we all need to be using to get results.
It feels like everyone's gone mad.
Here I am mostly writing code by hand, with some AI assistant help. I have a Claude subscription but only use it occasionally because it can take more time to review and fix the generated code as it would to hand-write it. Claude only saves me time on a minority of tasks where it's faster to prompt than hand-write.
And then I read about people spending hundreds or thousands of dollars a month on this stuff. Doesn't that turn your codebase into an unreadable mess?
I've been thinking about this recently and it seems like the most enthusiastic boosters always suggest difference in results is a skill issue, but I feel like there are 4 factors which multiply out to influence how much value someone gets: - The quality of model output for _your particular domain / tech stack_. Models will always do better with languages and libraries they see a lot of than esoteric or proprietary - The degree to which "works" = "good" in your scenario. For a one off script, "works" is all that matters, for a long lived core library, there are other considerations. - The degree to which "works" can be easily (best yet, automatically) verified. - Techniques, existing code cleanliness, documentation etc.
Boosters tend to lay all different experiences at the feet of this last, yet I'd argue the others are equally significant.
On the other hand, if you want to get the best results you can given the first 3 (which are generally out of one's control) then don't presume there's nothing you can do to improve the 4th.
Why read code when you are getting results fast ? See https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
I am not kidding. People don't seem to understand what's actually happening in our industry. See https://www.linkedin.com/posts/johubbard_github-eleutherailm...
Why is everyone obsessed with Mac Minis. They're awesome but for the work that these people are attempting to do? Just seems... nonsensical. Renting a server is cheaper and still just as "local" as any of this (they want "self hosted", I don't think anyone cares about local. Like are people air gapping networks? lol)
And a senior director of Nvidia? He had several Mac Minis? I really gotta imagine a Spark is better... at least it'll be a bit smarter of a cat (I'm pretty suspicious he used a LLM to help write that post)
No time to think, gotta go fast?
It seems like the monkey-ladders story. Someone probably just had one sitting around and it worked or needed to do something Apple-specific and that message got lost along the way
They want access to Apple Messages. That's all there's to it AFAICT.
I'm not getting results. That's the point. Claude doesn't fucking work without human intervention. When left to its own devices it makes bad decisions. It writes bad code. It needs constant supervision to stop it from going off the rails and replacing working code with broken code. It doesn't know what it's doing!
It's about as far as you can get from being able to work independently.
Yegge is an entertainer. Gas Town is performance art, it's not meant to be taken seriously.
How much are you spending ? See initial post of the thread. My team has no problems with it, they are spending each 5-10k per month.
These are like, jokes right?
I think the output of companies that can invest on tokens vs those who cannot will lead to crazy different outcomes in the next few years.
I can't really tell if this is sarcasm or not.
That's how half of these "agents" posts feel to me in general.
We have a very uncomplicated setup with claude code. A CLAUDE.md with instructions and notes about the repo and how to run stuff. We also do code reviews with Claude Code, but in a separate session.
It works wonderfully well. Costs about $200USD per developer per month as of now.
Paste the comment you replied to into a LLM good at planning. That’s something the codex/claude setups can create for you with a little back and forth.
Check out Mike Pocock’s work, he’s done excellent work writing about red green refactor and has a GitHub repo for his skills. Read and take what you need from his tdd skill and incorporate it into your own tdd skill tailored for your project.
This is just ai slop. If you follow what the actual designers of Claude/GPT tell you it flys in the face of building out over engineered harnesses for agents.
I agree with this. There is not a lot of harnesses/wrapping needed for Claude Code.
You don't need a harness beyond Claude Code, but honestly it's foolish to think you shouldn't be building out extra skills to help your workflow. A TDD skill that does red-green-refactoring is using Claude Code exactly as how it's meant to be used. They pioneered skills.
Yep, not saying we don't need skills. Just harnesses.
Works better than standard claude / gpt, which doesn't do red-green-refactor. Doesn't seem like slop when it meaningfully changes the results for the better, consistently. Really is a game-changer. You should consider trying it.
I do do TDD but using skills in this way is an anti-pattern for a multitude of reasons.
I don't think just saying it's an anti-pattern for a multitude of reasons and then not naming any is sufficiently going to convince anyone it's an anti-pattern.
This is in fact precisely what skills is meant for and is the opposite of an anti-pattern, but more like best practice now. It's explicitly using the skills framework precisely how it was meant to be used.
This is very interesting, but like sibling comments, I'm very curious as to how you run this in practice. Do you just tell Claude/Copilot to do what you describe?
And do you have any prompts to share?
You don't need most of this. Prompts are also normally what you would say to another engineer.
* There is a lot of duplication between A & B. Refactor this.
* Look at ticket X and give me a root cause
* Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds
Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc
They never do.
Sign up for your Claude Max (TM) subscription and have Claude set you up
Not Claude/Copilot. Claude.
This seems quite amazing really, thanks for sharing
What is the scope of projects / features you’ve seen this be successful at?
Do you have a step before where an agent verifies that your new feature spec is not contradictory, ambiguous etc. Maybe as reviewed with regards to all the current feature sets?
Do you make this a cycle per step - by breaking down the feature to small implementable and verifiable sub-features and coding them in sequence, or do you tell it to write all the tests first and then have at it with implementation and refactoring?
Why not refactor-red-green-refactor cycle? E.g. a lot of the time it is worth refactoring the existing code first, to make a new implementation easier, is it worth encoding this into the harness?
I do it per feature, not per step. Write the AC for the whole feature upfront, then the agent builds against it. I haven't added a spec-validation step before coding but that's a good idea. Catching ambiguity in the spec before the agent runs with it would save a lot of rework
I'm curious how this works if the green team writes an implementation that makes a network call like an RPC.
Red team might not anticipate this if the spec does detail every expected RPC (which seems unreasonable: this could vary based on implementation). But a unit test would need mocks.
Is green team allowed to suggest mocks to add to the test? (Even if they can't read the tests themselves?) This also seems gamaeable though (e.g. mock the entire implementation). Unless another agent makes a judgement call on the reasonability of the mock (though that starts to feel like code review more generally).
Maybe record/replay tests could work? But there are drawbacks in the added complexity.
I think the solution here is: Don't mock and inject dependencies explicitly, as function parameters / monads / algebraic effects. Make side effects part of the spec/interface.
Someone directly down from you suggested looking up Mike Postock's TDD skill, so I did:
https://github.com/mattpocock/skills/blob/main/tdd%2FSKILL.m...
Everything below quoted from that skill, and serves as a much better rebuttal than I had started writing:
DO NOT write all tests first, then all implementation. This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."
This produces crap tests:
Tests written in bulk test imagined behavior, not actual behavior You end up testing the shape of things (data structures, function signatures) rather than user-facing behavior Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine
You outrun your headlights, committing to test structure before understanding the implementation
Correct approach:
Vertical slices via tracer bullets.
One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.
>One test → one implementation → repeat.
>Because you just wrote the code, you know exactly what behavior matters and how to verify it.
what you go on to describe is
One implementation → one test → repeat.
Seems like red team is incentivized to write tests that violate the spec since you're rewarding failed tests.
This seems like a tremendous amount of planning, babysitting, verification, and token cost just to avoid writing code and tests yourself.
It's assigning yourself the literal worst parts of the job - writing specs, docs, tests and reading someone else's code.
There's a real disconnect. I was talking to a junior developer and they were telling me how Claude is so much smarter than them and they feel inferior.
I couldn't relate. From my perspective as a senior, Claude is dumb as bricks. Though useful nonetheless.
I believe that if you're substantially below Claude's level then you just trust whatever it says. The only variables you control are how much money you spend, how much markdown you can produce, and how you arrange your agents.
But I don't understand how the juniors on HN have so much money to throw at this technology.
So I take that feeling and use it to drive me to become a wizard like them. I've generally found that wizards are very happy to take on apprentices.
I'm not trying to call Claude a wizard (I have similar feelings to you), but more that I don't understand that junior's take. We all feel dumb. All but time. Even the wizards! But it's that feeling that drives you to better yourself and it's what turns you into a wizard.
Honestly so much of what I hear from the "AI does all my coding" crowd just sounds very junior. It's just the same like how a year or two ago they were saying "it does the repetitive stuff". Isn't that what functions, libraries, functors, templates, and other abstractions are for? It feels like we're back to that laughable productivity metric of lines of code or number of commits. I don't know why we love our cargo cults. It seems people are putting so much effort into their cargo cults that they could have invented a real airplane by now.
It's 20 dollars a month to use...
Yes for the basic plan. However there are people who claim to use the API and spend hundreds, or thousands, of dollars a month.
It just seems totally crazy to me, I don't understand how wrestling with this slot machine is even mentally easier
Yes with the reward of: I don't understand this code and didn't learn anything incrementally about the feature I "planned".
Well they probably have the same ability to evaluate the correctness of a feature as a middle manager with a Harvard business degree
How do you make sure Red Team doesn't just write subtly broken tests?
How do you define visibility rules? Is that possible for subagents?
AFAIK Claude doesn't support it, but if you're willing to go the extra mile, you can get creative with some bash script: https://pastebin.com/raw/m9YQ8MyS (generated this a second ago - just to get the point across )
To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.
I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.
> To be clear, I don't do this.
How do you know that it works then? Are you using a different tool that does support it?
So what do you do? Do you define roles somewhere and tell the agent to assign these roles to subagents?
Fun to see you not on tildes.
Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.
So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.
Sounds interesting, but I'm not quite getting the relevance for people writing code with an agent. Should I be doing evals?
Well I mean yes. I think people ought be aware for how the harnesses compare for their stacks. But clean room applies for this RGR situation too
you are replying to a bot, that's why.
What