Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.
I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.
edit: they just removed the reference to "3.1" from the pdf
I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.
The Deep Think moniker is for parallel compute models though, not long CoT like pro models.
It's possible though that deep think 3 is running 3.1 models under the hood.
That's odd considering 3.0 is still labeled a "preview" release.
I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.
The rumor was that 3.1 was today's drop
Where are these rumors floating around?
One of many https://x.com/synthwavedd/status/2021983382314660075
Huh, so if a China-based lab takes ARC-AGI-2 on the new year, then they can say they had just-shy of a solution anyway.
> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
They never will do on private set, because it would mean its being leaked to google.