Testing against unspecified other "leading" models allows for shenanigangs:

> Qodo tested GPT‑4.1 head-to-head against other leading models [...] they found that GPT‑4.1 produced the better suggestion in 55% of cases

The linked blog post goes 404: https://www.qodo.ai/blog/benchmarked-gpt-4-1/

The post seems to be up now and seems to compare it slightly favorable to Claude 3.7.

Right, now it's up and comparison against Claude 3.7 is better than I feared based on the wording. Though why does the OpenAI announcement talk of comparison against multiple leading models when the Qodo blog post only tests against Claude 3.7...