I think the benefit may be task separation and cleaning the context between tasks. Asking a single session to do all three has a couple of downsides.
1. The context for each task gets longer, which we know degrades performance.
2. In that longer context, implicit decisions are made in the thinking steps, the model is probably more likely to go through with bad decisions that were made 20 steps back.
The way Stavros does it, is Architect -> Dev -> Review. By splitting the task in three sessions, we get a fresh and shorter context for each task. At minimum skipping the thinking messages and intermediary tool output, should increase the chances of a better result.
Using different agent personas and models at least introduces variability at the token generation, whether it's good or bad, I do not know. As far as I know in general it's supposed to help.
Having the sessions communicate I think is a mistake, because you lose all of the benefits of cleaning up the context, and given the chattiness of LLMs you are probably going to fill up the context with multiple thinking rounds over the same message, one from the session that outputs the message and one from the session reading the message, you are probably going to have competing tool uses, each session using it's own tool calls to read the same content, it will probably be a huge mess.
The way I do it is I have a large session that I interact with and task with planning and agent spawning. I don't have dedicated personas or agents. The benefits the way I see them are I have a single session with an extensive context about what we are doing and then a dedicated task handler with a much more focused context.
What I have seen with my setup is, impressively good performance at the beginning that degrades as feedback and tweaks around work pile up.
Framing LLM use for dev tasks as "narrative" is powerful.
If you want specific, empirical, targeted advice or work from an LLM, you have to frame the conversation correctly. "You are a tenured Computer Science professor agent being consulted on a data structure problem" goes a very long way.
Similarly, context window length and prior progress exerts significant pressure on how an LLM frames its work. At some point (often around 200k-400k tokens in), they seem to reach a "we're in the conclusion of this narrative" point and will sometimes do crazy stuff to reach whatever real or perceived goal there is.