> Constant: the IDOR dataset (the same real, open-source applications we've used in prior research) ...

What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?

One would. But then the results are even weirder as opus 4.6 scored more than opus 4.8 by a huge margin