Hacker News

> poorly constructed arbitrary experiments which say very little about the competency of either model.

No one ever says this about the “pelican on a bicycle” metric

Actually, simonw has started saying that after qwen 27B beat Opus 4.7

https://news.ycombinator.com/item?id=48446348

I am willing to guess it is but gets downvoted or similar. Simon is a bit of a cult of personality on HN for better or worse.

mrngld 7 hours ago [ - ]

I have his blog in my RSS app and I click every pelican test because it's fun. I think criticizing it for lack of scientific or technical rigor kind of misses its point. It's a fun curiosity.

redsocksfan45 7 hours ago [ - ]

Simon's pelican is in fact routinely criticised for exactly that.

an0malous 7 hours ago [ - ]

Here it is on the latest Opus release 11 days ago, it’s the 5th highest voted comment on the post and the most critical comment is “should you at least try like 10 times or something to average the random effects”:

https://news.ycombinator.com/item?id=48311979

Gemini Flash release 19 days ago, again no criticism:

https://news.ycombinator.com/item?id=48198232

irthomasthomas 6 hours ago [ - ]

Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.

  "there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
 
  Today, even that loose connection to utility has been broken..."

https://simonwillison.net/2026/Apr/16/qwen-beats-opus/