Hacker News

Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

gs17 2 months ago [ - ]

Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.

sigmar 2 months ago [ - ]

I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.

edit: they just removed the reference to "3.1" from the pdf

josalhor 2 months ago [ - ]

I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.

WarmWash 2 months ago [ - ]

The Deep Think moniker is for parallel compute models though, not long CoT like pro models.

It's possible though that deep think 3 is running 3.1 models under the hood.

staticman2 2 months ago [ - ]

That's odd considering 3.0 is still labeled a "preview" release.

ainch 2 months ago [ - ]

I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.

2 months ago [ - ]

[deleted]

WarmWash 2 months ago [ - ]

The rumor was that 3.1 was today's drop

losvedir 2 months ago [ - ]

Where are these rumors floating around?

beauzero 2 months ago [ - ]

One of many https://x.com/synthwavedd/status/2021983382314660075

thadk 2 months ago [ - ]

Huh, so if a China-based lab takes ARC-AGI-2 on the new year, then they can say they had just-shy of a solution anyway.

riku_iki 2 months ago [ - ]

> If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"

They never will do on private set, because it would mean its being leaked to google.