Hmm this strategy only makes sense if you can trivially evaluate each agent's results, which I haven't found to be the case.

I expect a common case would be: one agent wrote code that does the thing I want. One agent wrote code that isn't unmaintainable garbage. These are not the same agent. So now you have to combine the two solutions which is quite a lot of work.