... according to grok-4-1-fast-non-reasoning who was the judge, on 4 tasks in total, score was 38 to 33 so obviously huge conclusions can be made.
> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.
grok-4-1-fast was retired about a month ago.
Requests to grok-4-1-fast-non-reasoning now silently route to grok-4.3 (a 5x more expensive model), with reasoning set to "none".
https://docs.x.ai/developers/migration/may-15-retirement
TFA was published today, which implies grok-4.3 was used.
What specific single model being used is like the least of the issues with their methodology.
Pretty small sample size here, but it's hard to avoid the conclusion that DeepSeek and friends will start to put some serious downward pressure on frontier lab token pricing.
Hopefully this dynamic continues long enough to make local/private inference the leading solution for coding.
It seems frontier, on the balance, would rather lose that segment of he market than lower the API price. They are getting the bag in the enterprise segment, those clients aren't ditching them for DeepSeek.
As for other segments, high API pricing gets people to switch to the subscriptions instead which is stickier than the API.
I've been hearing that Anthropic want all major AI providers to stop developing front tier models for a year for safety reasons. The real reason is they need time to get there models cheaper because of the DeepSeek threat or local llms or other even cheaper providers.
Seems like a ridiculous request - how can they ensure China will stop developing frontier models?
The OP uses tons of typical AI turns of phrase, and Pangram classified it as AI with high confidence.
So it doesn't surprise me at all that the methodology is weak, too.