Hacker News

Does a bit worse than Opus 4.8 in my tests[0], but it's 5x cheaper and 3x slower.

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-8-mediu...

Note that being open-weights, "slower" is relative, as it depends on who's serving the model. This can drastically change over time too.

nsoonhui 18 hours ago [ - ]

Not sure what to make if your benchmark because GPT 5.5(low) ranks higher than GPT 5.5 (medium) -- #4 vs #9

XCSme 18 hours ago [ - ]

You'd be surprised, some models on high do worse than on medium, because they start overthinking and doubting themselves, polluting the context with too much information, etc.

It depends a lot on the task and harness too (using plans and to-do lists, vs one-shot answers), but for simply answering directly to an inquiry, often extra thinking doesn't necessarily improve the answer, especially if the answer is binary, or can be correct or wrong, as opposed to having more time to refine a creative output.

XCSme 18 hours ago [ - ]

Another example was Gemini 3.1 flash lite, which on high was basically just burning tokens, costing like 30x more, while giving worse answers:

https://aibenchy.com/compare/google-gemini-3-1-flash-lite-hi...