That’s probably more work than the entire repo itself. Would need to be something like SWE-bench with and without “adamsreview”.

You’re right though, but evals are actually fairly tricky to write and maintain.