Why not just use an eval harness to prove this catches more real bugs? Benchmarks on actual bug classes would be far more convincing than comparing against /review.
Why not just use an eval harness to prove this catches more real bugs? Benchmarks on actual bug classes would be far more convincing than comparing against /review.
That's a great idea. I had trouble finding anything like this, a benchmark made for (AI) code reviewers.
I had expected to find something like an eval harness available on GitHub, but couldn't find it.
Any suggestions? Or maybe we/I/someone should build something like this?
I suppose one challenge is that if it's going to be publicly available, it would also be easy to cheat, but still seems it would be useful if people agreed it's a good benchmark and could easily re-test tools themselves.
https://www.codereviewbench.com/
https://codereview.withmartian.com/
Based on how same models rank fluctuates week to week, all I can conclude is that no frontier models is statistically better than the other or it's too task dependent that the result cannot converge.
That’s probably more work than the entire repo itself. Would need to be something like SWE-bench with and without “adamsreview”.
You’re right though, but evals are actually fairly tricky to write and maintain.