Hacker News

Don't appreciate the slander, but I'll respond anyhow.

Contrary to your predisposition, we're actually quite peeved that we might be seeing results from 5.6 instead of 5.5, as it's muddying our own internal data.

We've run the tasks on this benchmark hundreds of times for our own internal harness. It got magically better yesterday. Last week we were seeing worse performance (sub-80%).

I agree that benchmarks don't mean much for real world use, and I'm a bit disappointed at the lack of variety in the published benchmarks so far.

With that said, 88.8% is higher than Mythos, and the highest I've seen from vanilla Codex. If 5.6 is any better than 5.5, you'd think they would avoid publishing just one coding-related benchmark with a score that equals their previous model.

> I'm not sure why a higher scores on a few tests [..]

It's not just higher scores, the API is no longer flagging tests for cybersecurity warnings that it's been flagging for weeks.