Hacker News

blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

maxall4 8 hours ago [ - ]

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

moffkalast 3 hours ago [ - ]

https://arcprize.org/arc-agi/1/

It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.

I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1

boplicity 8 hours ago [ - ]

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

energy123 6 hours ago [ - ]

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

tasuki 6 hours ago [ - ]

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

ainch 4 hours ago [ - ]

He's always said ARC is a necessary but not sufficient condition for testing intelligence afaik

CamperBob2 6 hours ago [ - ]

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

layer8 6 hours ago [ - ]

The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

blinding-streak 8 hours ago [ - ]

I assume all the frontier models are benchmaxxing, so it would make sense

sho_hn 8 hours ago [ - ]

The touted SVG improvements make me excited for animated pelicans.

takoid 8 hours ago [ - ]

I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj

The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.

onionisafruit 7 hours ago [ - ]

Good to see it wearing a helmet. Their safety team must be on their game.

BrokenCogs 7 hours ago [ - ]

Yes but why would a pelican need a helmet? If it falls over it can just fly away... Common sense 1 Gemini 0

throwa356262 4 hours ago [ - ]

Obviously these domestic pelicans can't fly, otherwise why would they need a bike?

Gander5739 4 hours ago [ - ]

Why would a pelican be riding a bicycle at all, for that matter?

BrokenCogs 2 hours ago [ - ]

Because the user asked for it

tasuki 4 hours ago [ - ]

That's a good pelican. What I like the most is that the SVG is nice and readable. If only Inkscape could output nice SVG like this!

makeavish 8 hours ago [ - ]

Looks great!

benatkin 7 hours ago [ - ]

Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output

It does say 3.1 in the Pro dropdown box in the message sending component.

james2doyle 7 hours ago [ - ]

The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...

aoeusnth1 8 hours ago [ - ]

I imagine they're also benchgooning on SVG generation

vunderba 7 hours ago [ - ]

SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).

rdtsc 6 hours ago [ - ]

My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.

DonHopkins 6 hours ago [ - ]

How about STL files for 3d printing pelicans!