Hacker News

simonw 10 hours ago [ - ]

The pelican is a lot: https://github.com/simonw/llm-gemini/issues/133#issuecomment...

Not a great bicycle though, it forgot the bar between the pedals and the back wheel and weirdly tangled the other bars.

Expensive too - that pelican cost 13 cents: https://www.llm-prices.com/#it=11&ot=14403&sel=gemini-3.5-fl...

hedgehog 10 hours ago [ - ]

That pelican looks like it's in Miami for a crypto conference.

seemaze 5 hours ago [ - ]

That pelican wears it's sunglasses at night. So it can, so it can keep track of the visions in it's eyes.

baochillchill 3 hours ago [ - ]

It looks quite funny.

whh 5 hours ago [ - ]

Pelican and I need an optometrist urgently

joseda-hg 10 hours ago [ - ]

It looks like the starting soon screen of a crypto presentation

xattt 10 hours ago [ - ]

It looks like it’s been partying for 60 years based on the wrinkles on its pouch.

coffeecoders 6 hours ago [ - ]

That pelican looks like it lost 100k on NFTs and now runs a paid stock-trading group.

Xenoamorphous 9 hours ago [ - ]

Pelican in a white Testarossa.

airstrike 6 hours ago [ - ]

They're called ClawCons now

sho_hn 5 hours ago [ - ]

Personally, I don't attend them since I figured out I can set up agents to performatively engage in AI-related discussion and events for me, freeing up tons of my time thanks to automation.

Truly: Nothing better than AI tools to brave the challenges and requirements of modern life. "Claude, ride the hype train" is the decisive prompt you need.

brindleth 7 hours ago [ - ]

It look like the start of a new viral Peliwave aesthetic

egillie 9 hours ago [ - ]

and somehow in 1992

verdverm 9 hours ago [ - ]

sorta looks like the Tron ripoff in the I/O keynote

5 hours ago [ - ]

[deleted]

9 hours ago [ - ]

[deleted]

irthomasthomas 10 hours ago [ - ]

This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.

edit: fixed human hallucination

derefr 9 hours ago [ - ]

When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

I ask because:

Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

irthomasthomas 9 hours ago [ - ]

I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.

And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.

stared 7 hours ago [ - ]

To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

p1esk 5 hours ago [ - ]

What is “Sonnet 3.7 moment”?

stirfish 4 hours ago [ - ]

Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

8 hours ago [ - ]

[deleted]

Araopa 4 hours ago [ - ]

So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).

sosborn 5 hours ago [ - ]

This matches my experience with human too FWIW.

emp17344 5 hours ago [ - ]

Why is there always an identical reply like this when anyone criticizes LLMs?

gowld 5 hours ago [ - ]

It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.

girvo 7 hours ago [ - ]

Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.

tantalor 10 hours ago [ - ]

Forgetting the chainstay is typical of asking random people to draw a bicycle.

https://www.gianlucagimini.it/portfolio-item/velocipedia/

> most ended up drawing something that was pretty far off from a regular men’s bicycle

et1337 9 hours ago [ - ]

Asking random people to write SVG gives even worse results

lxgr 8 hours ago [ - ]

Especially without being able to look at the rendered output! (At least I'd be surprised if modern server-side tool calls regularly include an SVG renderer that can show a rasterized version to the model to iterate on it.)

gpm 4 hours ago [ - ]

One of the many things Google was pitching today is that they're going to run things like google search with access to linux container environments to do things like run tool calls... which will presumably be able to rasterize SVGs and show them to the model.

But Simon says he runs these through the API without tool access specifically to prevent that sort of "cheating". I.e. it's an LLM benchmark not an LLM+Harness benchmark.

Barbing 26 minutes ago [ - ]

Thanks for the delightful Velocipedia

Eji1700 7 hours ago [ - ]

Although every single render of those has pedals on the correct side as opposed to the Gemini optical illusion back pedal that tries to be both on the other side of the central gear and infront of the back wheel.

Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.

dankwizard 38 minutes ago [ - ]

Wouldn't be a thread about the tech that is changing the landscape for businesses across nearly every discipline without a pelican svg.

smcleod 10 hours ago [ - ]

I feel like it embodies Google's vibe of an uncool guy trying to stay relevant to the youth pretty well.

dzhiurgis 4 hours ago [ - ]

That's grok. IMO both gemini and grok are the most overlooked models.

tandr 5 hours ago [ - ]

If you sort that table by "output token price", it gets really terrifying - going from 4 cents up to $600 =8-O

VectorLock an hour ago [ - ]

The fact it went for vaporwave styling on its own is very telling.

nrds 5 hours ago [ - ]

We've been daily-driving this model for a few weeks and let me tell you, everything it does is a lot. Fast as fuck and it's actually not bad intelligence-wise for a fast model. It basically tries to make up for any intelligence deficit by just doing a lot, checking a lot, retrying a lot.

That's not to say I don't spend my days raging at it... a lot... but it's not that bad. It does tend to ignore completion criteria but it doesn't obviously degrade when being nudged like some models do.

dekhn 5 hours ago [ - ]

I'm told there is a new Jeff Dean fact inside google: "Jeff Dean manually adjusts the weights in the model just to screw with Simon".

karmakaze 5 hours ago [ - ]

I'm hoping we'll have many of these pelican cyclist pictures collected. Then when all the models can do it well, we'll stop posting about them, and dhen the next generations of AIs train on the data we'll have these canonical archetypes.

bee_rider 3 hours ago [ - ]

I wonder if they added all these unrequested details as an Easter-egg or something? (Since they must be aware of your test by now).

hydra-f 10 hours ago [ - ]

Same old issue with Gemini models trying to "enrich" everything

taurath 5 hours ago [ - ]

I can’t help but think that what AI is best at is convincing management that things it creates are full featured which reads to their brains as mature

nickvec 8 hours ago [ - ]

I enjoy the vaporwave aesthetic it went for. Looks like the pelican has a fish in its mouth too?

https://en.wikipedia.org/wiki/Vaporwave

khy 9 hours ago [ - ]

That sun is very similar to the one from the background of this other top HN post about the OS museum: https://news.ycombinator.com/item?id=48195009

sbinnee 7 hours ago [ - ]

Wow what’s with all the styling? Is it manifestation of google’s styling bias? I like the result for sure. It’s shiny and pretty. But then it’s something I didn’t ask for.

danilocesar 5 hours ago [ - ]

Given your pelican is very famous now, don't you think they are adding instructions to beat this benchmark those days?

Culonavirus 5 hours ago [ - ]

Well clearly it's not working lmao

setgree 9 hours ago [ - ]

``

wtf

``

WTF??

__mharrison__ 8 hours ago [ - ]

They are just trolling you now

Razengan 4 hours ago [ - ]

I've found prompts like "capybara with spotted fur and 7 octopus tentacles instead of legs, each a different color, riding a tricycle" etc. to be a better test

Last time I tried, ChatGPT's image generator got the best result.

gcgbarbosa 10 hours ago [ - ]

funny that when I try the same prompt, gemini generates an image, not an SVG. something is not right.

simonw 10 hours ago [ - ]

That's likely because you're using the Gemini app which has a tool for image generation (nano banana) - I do my tests against the API to avoid any possibility of tool use.

nickmccann 9 hours ago [ - ]

This question makes me wonder if you one shot each pelican or do you run it a few times to get the best one?

simonw 7 hours ago [ - ]

I one-shot. I have a long-standing ambition to have each model generate 3x and then get the model (assuming it's a vision model) to pick the best one.

nashashmi 10 hours ago [ - ]

Beats a human by like 10$

unglaublich 10 hours ago [ - ]

So according to Google logic, the value of the pelican is $10-eps. (They applied that reasoning to conversions via adwords)

Barbing 23 minutes ago [ - ]

Eps?

TacticalCoder 7 hours ago [ - ]

Love your pelicans, as always. And that one is... Wow.

I noticed the "Synthwave" aesthetic, which is enjoying quite some success since quite some time now, has found its way into AI models (even when it's not in the user's query). It's not the first time I see the sun at sunset with color bands etc. in AI-generated pictures. Don't know why it's now taking on in AI too.

https://en.wikipedia.org/wiki/Synthwave

Hence the comments here about the 90s, Sonny Crockett's white Ferrari Testarossa in Miami, etc.

To be honest as a kid from the 80s and a teenager from the 90s who grew up with that aesthetic in posters, on VHS tape covers, magazine covers, etc. I do love that style and I love that it made a comeback and that that comeback somehow stayed.

kridsdale3 6 hours ago [ - ]

Sythwave vibe hype hit a cultural high point with the release of Far Cry 3 Blood Dragon in 2013.

So it's as relevant and baked-in to today as actual 80s synth-culture was in 2000.

professoretc an hour ago [ - ]

"Look around to look around."

gowld 5 hours ago [ - ]

At the keynote today, Sundar Pichai asked Gemini to clone the Dino Game, and it added a synthwave-esque aesthetic.

holtkam2 10 hours ago [ - ]

at a certain point you're gonna need to change your benchmark because this will end up in the model's training set

simonw 10 hours ago [ - ]

Gemini were the team most likely to have this in their training set - see https://x.com/JeffDean/status/2024525132266688757 - and yet their latest model still messes up the bicycle frame!

recursive 8 hours ago [ - ]

I'm sure that certain point came and went many releases ago.

10 hours ago [ - ]

[deleted]