Hacker News

Actually its average because it's 5th on the leaderboard. GPT 5.5 and Opus 4.8 outperformed it. At 5-8x the cost, you would expect better! https://www.endorlabs.com/research/ai-code-security-benchmar...

As TFA says

> Two findings may help explain these average results. > Timeouts > Highest observed cheating

That's why it's 5th on the leaderboard - they give it a fail for every timeout and for every time it gives the correct answer because it knows it.

That's insane