The fleet approach can work well particularly because: 1) different models are trained differently, even though using mostly same data (think someone who studied SWE at MIT, vs one who studied at Harvard), 2) different agents can be given different prompts, which specializes their focus (think coder vs reviewer), and 3) the context window content influences the result (think someone who's seen the history of implementation attempts, vs one seeing a problem for the first time). Put those traits in various combinations and the results will be very different from a single agent.