> Run the same agent n times to increase success rate.

Are there benchmarks out there that back this claim?

Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.

Interesting, and how does Twill uses it in that feature?

On the Twill web app, you can run the same task across different agents and multiple attempts (each in its own sandbox). Then you pick the best result. This is super handy for UI work where you can open the live preview for each attempt and compare. Next step for us is adding a final pass where an agent evaluates the results and combines the best parts into one PR.

[dead]

[dead]