Does a bit worse than Opus 4.8 in my tests[0], but it's 5x cheaper and 3x slower.
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...
Does a bit worse than Opus 4.8 in my tests[0], but it's 5x cheaper and 3x slower.
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...
Note that being open-weights, "slower" is relative, as it depends on who's serving the model. This can drastically change over time too.
Not sure what to make if your benchmark because GPT 5.5(low) ranks higher than GPT 5.5 (medium) -- #4 vs #9
You'd be surprised, some models on high do worse than on medium, because they start overthinking and doubting themselves, polluting the context with too much information, etc.
It depends a lot on the task and harness too (using plans and to-do lists, vs one-shot answers), but for simply answering directly to an inquiry, often extra thinking doesn't necessarily improve the answer, especially if the answer is binary, or can be correct or wrong, as opposed to having more time to refine a creative output.
Another example was Gemini 3.1 flash lite, which on high was basically just burning tokens, costing like 30x more, while giving worse answers:
https://aibenchy.com/compare/google-gemini-3-1-flash-lite-hi...