The pelican is a lot: https://github.com/simonw/llm-gemini/issues/133#issuecomment...
Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.
Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...
That pelican looks like it's in Miami for a crypto conference.
That pelican wears it's sunglasses at night. So it can, so it can keep track of the visions in it's eyes.
It looks quite funny.
Pelican and I need an optometrist urgently
It looks like the starting soon screen of a crypto presentation
It looks like it’s been partying for 60 years based on the wrinkles on its pouch.
That pelican looks like it lost 100k on NFTs and now runs a paid stock-trading group.
Pelican in a white Testarossa.
They're called ClawCons now
Personally, I don't attend them since I figured out I can set up agents to performatively engage in AI-related discussion and events for me, freeing up tons of my time thanks to automation.
Truly: Nothing better than AI tools to brave the challenges and requirements of modern life. "Claude, ride the hype train" is the decisive prompt you need.
It look like the start of a new viral Peliwave aesthetic
and somehow in 1992
sorta looks like the Tron ripoff in the I/O keynote
This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.
edit: fixed human hallucination
When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?
I ask because:
Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.
But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)
I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.
I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.
And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.
To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.
When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.
What is “Sonnet 3.7 moment”?
Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.
So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).
This matches my experience with human too FWIW.
Why is there always an identical reply like this when anyone criticizes LLMs?
It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.
Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.
Forgetting the chainstay is typical of asking random people to draw a bicycle.
https://www.gianlucagimini.it/portfolio-item/velocipedia/
> most ended up drawing something that was pretty far off from a regular men’s bicycle
Asking random people to write SVG gives even worse results
Especially without being able to look at the rendered output! (At least I'd be surprised if modern server-side tool calls regularly include an SVG renderer that can show a rasterized version to the model to iterate on it.)
One of the many things Google was pitching today is that they're going to run things like google search with access to linux container environments to do things like run tool calls... which will presumably be able to rasterize SVGs and show them to the model.
But Simon says he runs these through the API without tool access specifically to prevent that sort of "cheating". I.e. it's an LLM benchmark not an LLM+Harness benchmark.
Thanks for the delightful Velocipedia
Although every single render of those has pedals on the correct side as opposed to the Gemini optical illusion back pedal that tries to be both on the other side of the central gear and infront of the back wheel.
Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.
Wouldn't be a thread about the tech that is changing the landscape for businesses across nearly every discipline without a pelican svg.
I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.
That's grok. IMO both gemini and grok are the most overlooked models.
If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O
The fact it went for vaporwave styling on its own is very telling.
We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.
That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.
I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".
I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.
I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).
Same old issue with Gemini models trying to "enrich" everything
I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature
I enjoy the vaporwave aesthetic it went for. Looks like the pelican has a fish in its mouth too?
https://en.wikipedia.org/wiki/Vaporwave
That sun is very similar to the one from the background of this other top HN post about the OS museum: https://news.ycombinator.com/item?id=48195009
Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.
Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?
Well clearly it's not working lmao
`<!-- Pelican Eye / Sunglasses (Cool Retro Aviators) -->`
wtf
`<!-- Gold Rim -->`
WTF??
They are just trolling you now
I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test
Last time I tried, ChatGPT's image generator got the best result.
funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.
That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.
This question makes me wonder if you one shot each pelican or do you run it a few times to get the best one?
I one-shot. I have a long-standing ambition to have each model generate 3x and then get the model (assuming it's a vision model) to pick the best one.
Beats a human by like 10$
So according to Google logic, the value of the pelican is $10-eps. (They applied that reasoning to conversions via adwords)
Eps?
Love your pelicans, as always. And that one is... Wow.
I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.
https://en.wikipedia.org/wiki/Synthwave
Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.
To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.
Sythwave vibe hype hit a cultural high point with the release of Far Cry 3 Blood Dragon in 2013.
So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.
"Look around to look around."
At the keynote today, Sundar Pichai asked Gemini to clone the Dino Game, and it added a synthwave-esque aesthetic.
at a certain point you're gonna need to change your benchmark because this will end up in the model's training set
Gemini were the team most likely to have this in their training set - see https://x.com/JeffDean/status/2024525132266688757 - and yet their latest model still messes up the bicycle frame!
I'm sure that certain point came and went many releases ago.