>4.1 Was better in 55% of cases
Um, isn't that just a fancy way of saying it is slightly better
>Score of 6.81 against 6.66
So very slightly better
>4.1 Was better in 55% of cases
Um, isn't that just a fancy way of saying it is slightly better
>Score of 6.81 against 6.66
So very slightly better
"they found that GPT‑4.1 excels at both precision..."
They didn't say it is better than Claude at precision etc. Just that it excels.
Unfortunately, AI has still not concluded that manipulations by the marketing dept is a plague...
A great way to upsell 2% better! I should start doing that.
Good marketing if you're selling a discount all purpose cleaner, not so much for an API.
I don't think the absolute score means much — judge models have a tendency to score around 7/10 lol
55% vs. 45% equates to about a 36 point difference in ELO. in chess that would be two players in the same league but one with a clear edge
Rarely are two models put head-to-head though. If Claude Sonnet 3.7 isn't able to generate a good PR review (for whatever reason), a 2% better review isn't all that strong of a value proposition.
the point is oai is saying they have a viable Claude Sonnet competitor now