The summarized chain of thought for this task (linked in the blogpost) is 125 pages. That's an insane scale of reasoning, quite akin to what Anthropic has been teasing with Mythos.
The summarized chain of thought for this task (linked in the blogpost) is 125 pages. That's an insane scale of reasoning, quite akin to what Anthropic has been teasing with Mythos.
That's here for anyone wondering - https://cdn.openai.com/pdf/1625eff6-5ac1-40d8-b1db-5d5cf925d...
I note that (though summarized), this is ~100k tokens. Anyone who routinely works with Codex (or any agentic harness really) can tell you how trivial it is to eat up 100k tokens doing complex work. I've personally had plenty of codex 5.5 xhigh sessions where just the pure chain of thought token count in a single turn exceeds 200k (and I assume doesn't go further only due to compaction meta-guidance; the harness will push the model to stay under 256k per turn/thinking block) .
I think the more interesting question is how many tokens were spent all told; the most interesting graph in the article imo is the success rate by log test-time compute: how many tokens are being spent on the right of the graph to hit a winning CoT/solution like this >50% of the time?
Today I generated the equivalent of two LOTR books just to fix three missing rows in my SQL models (and open a PR), so +1
or put differently, you melted x cubic meters of polar ice