This is anecdotal but just a couple days ago, with some colleagues, we conducted a little experiment to gather that evidence.
We used a hierarchy of agents to analyze a requirement, letting agents with different personas (architect, business analyst, security expert, developer, infra etc) discuss a request and distill a solution. They all had access to the source code of the project to work on.
Then we provided the very same input, including the personas' definition, straight to Claude Code, and we compared the result.
They council of agents got to a very good result, consuming about 12$, mostly using Opus 4.6.
To our surprise, going straight with a single prompt in Claude Code got to a similar good result, faster and consuming 0.3$ and mostly using Haiku.
This surely deserves more investigation, but our assumption / hypothesis so far is that coordination and communication between agents has a remarkable cost.
Should this be the case, I personally would not be surprised:
- the reason why we humans do job separation is because we have an inherent limited capacity. We cannot reach the point to be experts in all the needed fields : we just can't acquire the needed knowledge to be good architects, good business analysts, good security experts. Apparently, that's not a problem for a LLM. So, probably, job separation is not a needed pattern as it is for humans.
- Job separation has an inherent high cost and just does not scale. Notably, most of the problems in human organizations are about coordination, and the larger the organization the higher the cost for processes, to the point processed turn in bureaucracy. In IT companies, many problems are at the interface between groups, because the low-bandwidth communication and inherent ambiguity of language. I'm not surprised that a single LLM can communicate with itself way better and cheaper that a council of agents, which inevitably faces the same communication challenges of a society of people.
If it could be done with 30 cents of Haiku calls, maybe it wasn't a complicated enough project to provide good signal?
In that case a $12 program is probably too big to meaningfully review. Probably better to have smaller chunks you can review instead of generating one really large program in one shot.
Fair point. I could try with a harder problem. This still does not explain why Claude Code felt the need to use Opus, and why Opus felt the need to burn 12$ or such an easy task. I mean, it's 40 times the cost.
I'm a bit confused actually, you said you used Claude Code for both examples? Was that a typo, or was it (1) Claude Code instructed to use a hierarchy of agents and (2) Claude Code allowed to do whatever it wants?
I think the benefit may be task separation and cleaning the context between tasks. Asking a single session to do all three has a couple of downsides.
1. The context for each task gets longer, which we know degrades performance.
2. In that longer context, implicit decisions are made in the thinking steps, the model is probably more likely to go through with bad decisions that were made 20 steps back.
The way Stavros does it, is Architect -> Dev -> Review. By splitting the task in three sessions, we get a fresh and shorter context for each task. At minimum skipping the thinking messages and intermediary tool output, should increase the chances of a better result.
Using different agent personas and models at least introduces variability at the token generation, whether it's good or bad, I do not know. As far as I know in general it's supposed to help.
Having the sessions communicate I think is a mistake, because you lose all of the benefits of cleaning up the context, and given the chattiness of LLMs you are probably going to fill up the context with multiple thinking rounds over the same message, one from the session that outputs the message and one from the session reading the message, you are probably going to have competing tool uses, each session using it's own tool calls to read the same content, it will probably be a huge mess.
The way I do it is I have a large session that I interact with and task with planning and agent spawning. I don't have dedicated personas or agents. The benefits the way I see them are I have a single session with an extensive context about what we are doing and then a dedicated task handler with a much more focused context.
What I have seen with my setup is, impressively good performance at the beginning that degrades as feedback and tweaks around work pile up.
Framing LLM use for dev tasks as "narrative" is powerful.
If you want specific, empirical, targeted advice or work from an LLM, you have to frame the conversation correctly. "You are a tenured Computer Science professor agent being consulted on a data structure problem" goes a very long way.
Similarly, context window length and prior progress exerts significant pressure on how an LLM frames its work. At some point (often around 200k-400k tokens in), they seem to reach a "we're in the conclusion of this narrative" point and will sometimes do crazy stuff to reach whatever real or perceived goal there is.
Probably the same reason it takes a team of developers and managers 6 months to write what one or two developers can do on their own in one week. The overhead caused by constant meetings and negotiations is massive.
just the other day i was asked to prepare slides for a presentation about something everyone already knows (among many other useless side-work)... i feel like with "ai" in general we are applying bandages where my real problem is the big machine that gives me paper cuts all day...
Even with humans I’ve found full ownership of a project from architecture to implementation to deployment and operation, produces the best results.
Less context switching and communication overhead. Focus on well thought out and documented APIs to divide work across developers and support communication and collaboration .
LLMs also don't have the primary advantage humans get from job separation, diverse perspectives. A council of Opuses are all exploring the exact same weights with the exact same hardware, unlike multiple humans with unique brains and memories. Even with different ones, Codex 5.3 is far more similar to Opus than any two humans are to each other. Telling an Opus agent to focus on security puts it in a different part of the weights, but it's the same graph-- it's not really more of an expert than a general Opus agent with a rule to maintain secure practices.
You can differentiate by context, one sees the work session, the other sees just the code. Same model, but different perspectives. Or by model, there are at least 7 decent models between the top 3 providers.
I know, but none of those is nearly as much of a difference as another human looking at code. The top models have such overlapping training data they sometimes identify as each other.
An ensemble can spot more bugs / fixes than a single model. I run claude, codex and gemini in parallel for reviews.
[dead]
Agentic pipelines and systems fall into the same issues as humans who work together, mostly communication.
It's not like they can dump their full context to the "manager" agent, they need to condense stuff, which will result in misinterpreted information or missing information on decisions down the line.
IMO this was more relevant when agents had limited context windows
This I believe is true. Have been working on an Agentic architecture and whenever there was a new requirement the simple workflow was to create an specific agent for that. Earlier the context windows were small and this was the default solution. Overtime our total agents have become vast that it is a headache to maintain and debug.
Absolutely works with frontier models. What do you think about smaller models in these pipelines? That’s literally what I’m working on, with qwen3.5-27b and im splitting the task to 4 steps and not sure if that’s the way to go. Do you have any experience to share?
[dead]