On the Twill web app, you can run the same task across different agents and multiple attempts (each in its own sandbox). Then you pick the best result. This is super handy for UI work where you can open the live preview for each attempt and compare. Next step for us is adding a final pass where an agent evaluates the results and combines the best parts into one PR.