Here it is on the latest Opus release 11 days ago, it’s the 5th highest voted comment on the post and the most critical comment is “should you at least try like 10 times or something to average the random effects”:

https://news.ycombinator.com/item?id=48311979

Gemini Flash release 19 days ago, again no criticism:

https://news.ycombinator.com/item?id=48198232

Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.

  "there has been a direct correlation between the quality of the pelicans produced and the general usefulness of the models ...
 
  Today, even that loose connection to utility has been broken..." 
https://simonwillison.net/2026/Apr/16/qwen-beats-opus/