If you used GPT-5.5 over the last 24 hours or so, you may have already had access to 5.6.

I've been running some tests on a harness we're building, and suddenly saw a jump in a few points yesterday. I reran the vanilla codex benchmark and saw an ~88% score on Terminal Bench 2.1 from GPT-5.5 on vanilla Codex.

The biggest indicator, beyond the score, was that 3 tests which frequently hit "safety" blockers with 5.5 started succeeding last night without warning.

these things can just change with infrastructure changes rather than be some mysterious A/B testing.

I don't disagree, we've seen performance shift with capacity changes in the past.

With that said, I doubt OpenAI would choose to publish a singular coding benchmark for a new model that exactly matches their previous model (88.8%).

[deleted]

[dead]

[flagged]

Don't appreciate the slander, but I'll respond anyhow.

Contrary to your predisposition, we're actually quite peeved that we might be seeing results from 5.6 instead of 5.5, as it's muddying our own internal data.

We've run the tasks on this benchmark hundreds of times for our own internal harness. It got magically better yesterday. Last week we were seeing worse performance (sub-80%).

I agree that benchmarks don't mean much for real world use, and I'm a bit disappointed at the lack of variety in the published benchmarks so far.

With that said, 88.8% is higher than Mythos, and the highest I've seen from vanilla Codex. If 5.6 is any better than 5.5, you'd think they would avoid publishing just one coding-related benchmark with a score that equals their previous model.

> I'm not sure why a higher scores on a few tests [..]

It's not just higher scores, the API is no longer flagging tests for cybersecurity warnings that it's been flagging for weeks.

Don't bother replying to the trolls around here.