> poorly constructed arbitrary experiments which say very little about the competency of either model.
No one ever says this about the “pelican on a bicycle” metric
> poorly constructed arbitrary experiments which say very little about the competency of either model.
No one ever says this about the “pelican on a bicycle” metric
Actually, simonw has started saying that after qwen 27B beat Opus 4.7
https://news.ycombinator.com/item?id=48446348
I am willing to guess it is but gets downvoted or similar. Simon is a bit of a cult of personality on HN for better or worse.
I have his blog in my RSS app and I click every pelican test because it's fun. I think criticizing it for lack of scientific or technical rigor kind of misses its point. It's a fun curiosity.
Simon's pelican is in fact routinely criticised for exactly that.
Here it is on the latest Opus release 11 days ago, it’s the 5th highest voted comment on the post and the most critical comment is “should you at least try like 10 times or something to average the random effects”:
https://news.ycombinator.com/item?id=48311979
Gemini Flash release 19 days ago, again no criticism:
https://news.ycombinator.com/item?id=48198232
Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.
https://simonwillison.net/2026/Apr/16/qwen-beats-opus/