I was thinking about building a GitHub repo made for evaluating Code Reviews. Something like a complex app (or perhaps a few branches with different options), and then PRs on each branch with varying types and degrees of bugs for a Code Review to find.
I suppose this would not be a 'real' benchmark because it would be public and so you couldn't necessarily trust scores people share about how their own tool did, but it would at least allow anyone to try out code review tools on their own and report relative effectiveness and characteristics.
I'll post again if I end up finding or building something like that. I couldn't find anything when I looked previously.
I'll also keep in mind your question as I continue testing this, because you are right that it would be useful to be able to describe what is different, not just the magnitude of bugs found.
https://codereview.withmartian.com/
https://www.greptile.com/benchmarks