Hacker News

new | ask | show | jobs

ElFitz 5 hours ago [ - ]

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.