Hacker News

> Run the same agent n times to increase success rate.

Are there benchmarks out there that back this claim?

Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.

hmokiguess 3 hours ago [ - ]

Interesting, and how does Twill uses it in that feature?

danoandco an hour ago [ - ]

On the Twill web app, you can run the same task across different agents and multiple attempts (each in its own sandbox). Then you pick the best result. This is super handy for UI work where you can open the live preview for each attempt and compare. Next step for us is adding a final pass where an agent evaluates the results and combines the best parts into one PR.

danoandco 3 hours ago [ - ]

[dead]

j_gonzalez 14 minutes ago [ - ]

[dead]