It’s four poorly constructed arbitrary experiments which say very little about the competency of either model.
The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.
Consider the lead:
> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.
1 star.
I think you've misunderstood the purpose of a lead (sic).
Per Merriam-Webster [^1], a lede is:
> the introductory section of a news story that is intended to entice the reader to read the full story
(Emphasis mine)
You may prefer more matter-of-fact phrasing, of course, but criticising a lede for attempting to achieve its goal is unjustified.
[^1]: https://www.merriam-webster.com/dictionary/lede
A 'lede' is just an intentionally differentiated spelling of 'lead'; the origin of the word is just lead. Collins dictionary defines lede: a variant spelling of lead
Is it not an intentional spelling in order to coin journalistic jargon?
TIL, thank you.
I think the criticism is less about whether the lede is good at achieving its goal and more about whether that goal is honorable in the first place.
So dismissing it on technicalities is for sure clever but also obvious and lame.
The Letter/spirit thing eventually got boring. Please find better material
I apologise if using words correctly is obvious and lame.
GP is explicitly criticising the language in the lede as being unsuitably vague, hence my reply.
As to the goal of the article, I fail to see what is dishonourable about comparing LLMs. You may consider the methodology flawed, but it's a perfectly respectable goal.
Sorry, was that another technicality? I'll try to find better material, just for you.
There are monied interests that do not want inexpensive Chinese successors to Scam Altman's creation.
They're inexpensive because they're derived from his creation.
The creation--which isn't "his" in the first place, by any standard definition--was not only itself "derived from" our creations but was always supposed to be "open".
> which isn't "his" in the first place, by any standard definition
I was saying that because of the previous comment:
> to Scam Altman's creation
It wasn't derived in the same way though - I can read loads of books and so can write my own book, but that's not derivation in the same way as the Deepseek's derivation.
It’s the hardest part of an article if you ask me.
Filling it with slop constructs signals the reader no effort was made writing the article. So no effort should be put into reading it.
The rest of the article is equally flimsy. Great clickbait title, perhaps that is even harder than writing a lede.
I am not a native speaker :)
I agree, I'd rather not see AI-generated articles about AI on HN unless they're really good.
(Three out of) four experiments is anecdotal for sure, but the result meshes with more established instruction following benchmarking (although DeepSeek V4 pro does not top these): https://artificialanalysis.ai/evaluations/ifbench
I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.
It's picking strange tasks that don't really play to GPT-Pro's strengths (that model is roughly comparable to Mythos, intended for very hard reasoning and research-level problems) and then completely ignoring quite a few cases where GPT-Pro actually got some things more correct than DeepSeek did. The auto-AI ranking is just not reliable for this stuff.
In the car business there is only one or two car models that are the best ideal choice, but many subpar companies and models, are still selling for many reasons.
It shows DeepSeek is competitive, if not better sometimes, than GPT 5.5. Also shows there is no moat. As such it is a highly significant signal.
I agree that there may be a lot of variation between models that leads to different use cases, at least today. But I’m not sure the car analogy works.
An X5 is not simply “inferior” to a CR-V, or vice versa. A Camry is not “inferior” to an F-150, or vice versa. They are optimized for different buyers, budgets, constraints, and use cases.
That may actually be the better analogy for AI models: there probably is not one universal “best” model. There are models that are better or worse for particular tasks, price points, latency requirements, deployment constraints, privacy needs, etc.
It's worse than that. It's more like being able to buy an X5 for $5 and produce them for $1000, skipping everything that made making an X5 hard.
> poorly constructed arbitrary experiments which say very little about the competency of either model.
No one ever says this about the “pelican on a bicycle” metric
Actually, simonw has started saying that after qwen 27B beat Opus 4.7
https://news.ycombinator.com/item?id=48446348
I am willing to guess it is but gets downvoted or similar. Simon is a bit of a cult of personality on HN for better or worse.
I have his blog in my RSS app and I click every pelican test because it's fun. I think criticizing it for lack of scientific or technical rigor kind of misses its point. It's a fun curiosity.
Simon's pelican is in fact routinely criticised for exactly that.
Here it is on the latest Opus release 11 days ago, it’s the 5th highest voted comment on the post and the most critical comment is “should you at least try like 10 times or something to average the random effects”:
https://news.ycombinator.com/item?id=48311979
Gemini Flash release 19 days ago, again no criticism:
https://news.ycombinator.com/item?id=48198232
Interesting that Simon declared the pelican dead when qwen 27B overtook opus 4.7. That seems a strange criteria to decide the utility of a benchmark, without more proof. I think it stems from the assumption that opus must be much larger. But I suspect that active parameters are more important than total parameters, and it is possible that new opus is a very sparse moe with close to 27B active params.
https://simonwillison.net/2026/Apr/16/qwen-beats-opus/[dead]