Actually its average because it's 5th on the leaderboard. GPT 5.5 and Opus 4.8 outperformed it. At 5-8x the cost, you would expect better! https://www.endorlabs.com/research/ai-code-security-benchmar...
Actually its average because it's 5th on the leaderboard. GPT 5.5 and Opus 4.8 outperformed it. At 5-8x the cost, you would expect better! https://www.endorlabs.com/research/ai-code-security-benchmar...
As TFA says
> Two findings may help explain these average results. > Timeouts > Highest observed cheating
That's why it's 5th on the leaderboard - they give it a fail for every timeout and for every time it gives the correct answer because it knows it.
That's insane