Hacker News

More information on OpenAI's result (which seems better than DeepMind's) from the X thread:

> our OpenAI reasoning system got a perfect score of 12/12

> For 11 of the 12 problems, the system’s first answer was correct. For the hardest problem, it succeeded on the 9th submission. Notably, the best human team achieved 11/12.

> We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.

I'm assuming that "GPT-5" here is a version with the same model weights but higher compute limits than even GPT-5 Pro, with many instances working in parallel, and some specific scaffolding and prompts. Still, extremely impressive to outperform the best human team. The stat I'd really like to see is how much money it would cost to get this result using their API (with a realistic cost for the "experimental reasoning model").

bazmattaz 3 days ago [ - ]

Ha so true. I was so tempted to copy and paste a problem into GPT5 and see what it would say

HardCodedBias 3 days ago [ - ]

They likely had a prompt that gave considerable guidance.

Hopefully that prompt was the same for all questions (I think that is what they did for the IMO submission, or maybe it was Google that did that, not sure).

qwertox 3 days ago [ - ]

> it succeeded on the 9th submission

What's the judgement here? Was it within the allotted time, or just a "try as often as you need to"?

modeless 3 days ago [ - ]

It was within the allotted time. If I'm reading the scoreboard correctly [edit: I wasn't], the human teams typically submitted dozens or hundreds of attempts at each problem.

kevinwang 3 days ago [ - ]

For problems that human teams eventually get correct, they seem to have submitted mostly 1 time -- occasionally 2 or 3. For problems that they did not get correct, there are some problems with up to 16 submissions.

Ah, I see I was in fact reading it wrong. So 9 is definitely an unusual but not unprecedented number of submissions.

jojomodding 2 days ago [ - ]

The way the rules work is that you can submit as often as you want. Team with the most solved problem wins. The time it took to solve all the problems is the tiebreaker.

But submitting a non-working solution gives you a time penalty (usually 20 mins). Yet this time penalty only applies if in the end, you actually solve the problem. So it never hurts to try.