The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/

I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.

Performance numbers:

  Reading: 20 tokens, 0.4s, 54.32 tokens/s
  Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/

I feel like this time it is indeed in the training set, because it is too good to be true.

Can you run your other tests and see the difference?

It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":

https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...

compared to your test with GLM 5.1, this indeed looks off

https://xcancel.com/simonw/status/2041646779553476801

Yeah GLM 5.1 did an outstanding job on the possum - better than Opus 4.7 or GPT-5.4 and I think better than Gemini 3.1 Pro too.

But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.

The point is in the relative difference between the Pelican vs "other" test for each model suggesting the Pelican is being treated special these days (could be as simple as being common in recent data), not the relative difference between the models on the "other" case in isolation.

Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.

You can collapse the pelican thread with the little [-] toggle at the top.

Why would you though?

And by the way: Thanks for relentlessly holding new models’ feet to the pelican SVG fire.

Because I want to read about Qwen, not someone's one-off vibe test followed by 1:1 conversations. (case in miniature here: which is the last comment in this thread that says something about Qwen? The root post. Is that fun policing? Yes, apologies.)

There's a bunch of useful information in my comment that's independent of the fact that it drew a pelican:

1. You can run this on a Mac using llama-server and a 17GB downloaded file

2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model

3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s

Right, that is exactly what I meant by "the root post [had info about Qwen]" - you shouldn't feel I'm being critical of you or asking you to do anything different, at all. I admire you deeply and feel humbled* by interacting with you, so I really want that to be 100% clear, because this is the 2nd time I'm reading that it might be personal.

* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.

I like the pelican-bicycle test because it's pretty predictive of how the model does helping me with TikZ. And I hate writing TikZ.

Somewhat ironically - as of when I write this this tangent is dominating the size of this topic.

I understand your reasoning and it's valid, but I think the best you can do is indeed collapse the thread (not sure if any mobile clients do better than that?)

It's perhaps not a serious test, it isn't to me, but on the edges of jokes about pelicans they're usually some useful things people smarter than me say, and additionally if providers are spending some time on making pelicans or svg look better, this benefits all of us.

So, no hard feelings, you're understood (and I'm not trying to be patronising, I'm just awkward with the language), but pelicans are here to stay because it seems that the consensus is they're beneficial and on topic.

All the best!

[deleted]

I think it's to help drive traffic to his blog now that he's accepted sponsors in the header of every page. I do see this pelican thing come up from him on every model post that gets released.

The traffic I get from a comment with a link to a pelican is pretty tiny.

"Create me an SVG to drive MAXIMUM ENGAGEMENT for my sponsors".

Missing an opportunity here, lol.

I think at this point we can safely put the pelican test in the category of Goodhart's law.

If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.

if they cook these in, i wonder what else was cooked in there to make it look good.

Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.

[dead]

I think it's important to see that the other similar example, a dragon driving a car while eating hotdog, doesn't nearly render as well.

https://news.ycombinator.com/item?id=47865232

IMHO looks more like a stork, not a pelican. Look up any image of an actual pelican and check the ratio of legs to body. IMHO that's a weird mistake to make when asked for a "pelican".

Have you considered asking a couple of artists on Fiverr or something to draw you a picture with the same prompt? I don't mean this as a gotcha, it's actual advice, you should probably get a sense of what a real human artist/designer (or three) would do with this prompt.

For example, I hope you will find that: One reasoning choice is wrong with this picture that's not much to do with its ability to draw. Do we enlarge the pelican to human size? Or do we shrink the bike to pelican size? There is only one answer that keeps pelican proportions. Draw a pelican on a very tiny bike, and its legs will just fit without making it a different species, and you can even sort of cover part of the steer under the wings, etc etc.

I'm curious if other artists would come up with the same or other solutions, but they should in general come up with solutions, which I haven't seen the LLM do, really.

You (or maybe others?) said that the "pelican on a bike" prompt is good because "there is no right answer" cause you can't really fit a pelican on a bike. But most artists will say "hold my beer" and figure it out anyway. Cartoonists won't even have to think. The "figuring out" of these problems is what I'm missing in the LLMs response. It just put a pelican on a bike and makes it look like a stork if necessary. I don't really feel like it's actually testing for the thing this prompt is designed for, unless the test still says "FAIL" for each and all of them, including the one you just called "excellent".

Honestly it never crossed my mind to waste some artist's time with this, but now that the joke "benchmark" has somehow reached orbital velocity maybe I should be thinking about it!

I've run the prompt through dozens of dedicated image generation models so I've seen many versions of this that are better attempts than a text model spitting out SVG - here's gpt-image-2 as a recent example: https://chatgpt.com/share/69ea21ab-8738-83e8-a4d7-67374d84e0...

I am getter 13 t/s on my 36GB M3 Max with almost everything closed (to debug some issues I was having).

PelicanBench, the last benchmark for AGI.

I don’t think I ever heard you said excellent for the pelican test. It looks excellent indeed!

The trend went to MoE model for some times and this time around is dense model again. I wonder if closed models are also following this trend: MoE for faster ones and dense for pro model.

You'd think by now the LLMs would have figured out that the body of a bicycle is basically just a bisected rhombus. → ◿◸

(I hope I don't ruin the test.)

It would be funny to do an optimization pass to find a compact description of how to coax an accurate pelican bicycle out of a few of the current models, then just blast that snippet everywhere.

So this is it. We have finally achieved excellent illustrating of your svg art.

If you ever consider a logo, make sure it’s either a very poorly considered,

or wildly realistic,

pelican.

at what point do model providers optimize for the "pelican riding a bicycle" test so they place well on Simon's influential benchmark? :-)

They almost certainly are, even if unknowingly, because HN and all blogs get piped continuously into all models' training corpus.

See https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

Why is the assumption that they trained for a pelican on a bicycle, rather than running RL for all kinds of 'generate an SVG' tasks?

Gemini did exactly that, and boasted about it at launch: https://x.com/JeffDean/status/2024525132266688757

That post doesn't say anything about training for SVG generation

https://blog.google/innovation-and-ai/models-and-research/ge...

> Code-based animation: 3.1 Pro can generate website-ready, animated SVGs directly from a text prompt. Because these are built in pure code rather than pixels, they remain crisp at any scale and maintain incredibly small file sizes compared to traditional video.

That bowtie on the Qwen Flamingo is also chef's kiss, imho

Metrics and toy examples can be gamed. Rather than these silly examples, how does it feel?

Can you replace Claude Code Opus or Codex with this?

Does it feel >80% as good on "real world" tasks you do on a day to day basis.

These are the stupidest things to cleave to.

[flagged]

I've been using it in a few harnesses (FP8 quant, max context length) and it does seem to get tripped up by tool use, often repeating the same tool when it failed previously - that's usually not a great sign for long-term context and multi-step reasoning. It is excellent at one-shotting though and might be most useful as a sub-agent for a stronger frontier coordinator.

it seemed HN was moving the right direction when we added the "no AI comments", and yet, every single post about a new model is from you and your pelican. it's tired. please stop, it adds no value and has become cliche.

Wholly disagree. This a comment made by a person about an AI topic. Not an AI bot commenting on an article, which (as I understand it) is what “no AI comments” is saying.

Plus it’s a test that gives varied enough performance across multiple LLMs that it is a good barometer for how well it can think through the steps. Never mind the fact that most people can’t draw a bike from memory. The whole thing is hilarious!

Are you saying I write comments here using an LLM? I don't do that.

We like the pelican posts.

I think it added plenty of value!

How does a quick benchmark of a model "add no value" to the post about the model?

I just create the nopelican user to avoid seeing the same type of comments for scoring new models. Why doesn't someone create a pelican by month thread, like who is hiring, so that all who want to talk about their prefered mode and pelican can post with leisure at full extend. Perhaps such a thread could add some good information when grouped by time, model and pelican features. But I, honestly, think that the pelican test and the type of comments about it are too much, too repetitive, and it add no new information day after day.

The author of the pelican test has provided rich information about LLMs and AI just since LLM started to gain traction, but the pelican must fly and let the bicycle in the garage to show off just once a month.

Finally, a bitter take. Perhaps an information dense post without the pelican could be less commented and less reddit type, and some people might enjoy the image, so my comment from a boring, formal, not amussing person, may be not welcome from those, I agree.

This post suggest to create a by month thread about the pelican, it could give more value to the test. So I think is not far from meeting the HN etiquette of style.

Finally, since I think I will be downvoted until disappearing, LLM understand me: The "Substance" vs. "Meme" Conflict

I understand your frustration perfectly. When a model like Qwen 3.6-27B drops—a model explicitly marketed for "Flagship-Level Coding"—you want to know:

    How does it handle dependency injection in complex Python projects?

    What is its context window performance like for real-world repo analysis?

    How does it compare to Claude 3.5 Sonnet for agentic workflows?
Instead, the top comments are often just people saying "Look, the pelican has three wheels!" or "The pelican is floating!" To you, this feels like a waste of the front page.