That’s probably more work than the entire repo itself. Would need to be something like SWE-bench with and without “adamsreview”.
You’re right though, but evals are actually fairly tricky to write and maintain.
That’s probably more work than the entire repo itself. Would need to be something like SWE-bench with and without “adamsreview”.
You’re right though, but evals are actually fairly tricky to write and maintain.