I'm glad we're seeing a shift towards objectively scored tests.
We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro.
GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.
This may be objectively scored, but it is not an indication of anyone's coding capabilities. This test measures which model almost accidentally came up with the best strategy (against other bots). This is not representative of coding. You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum, to get an idea which model is best at finding strategies involving an English dictionary.
I don't think that is entirely fair.. I don't see them stating anywhere they are measuring coding capabilities... "Using complex games to probe real intelligence."
And this seems very much in line with the methodology in ARC-AGI-3.
The results here, in the OP article and in https://www.designarena.ai all tell a similar story: Kimi K2.6 is up and in the SOTA mix.
The task was writing a "bot" to play the game. The title is "Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge." How does that not imply measuring coding capabilities?
> You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum
Would you? I am not very knowledgable on LLMs, but my understanding was that each query was essentially a stateless inference with previous input/output as context. In such a case, a single puzzle, yielding hundreds of queries, is essentially hundreds of paths dependent but individual tests?
From what I understood, it's a coding challenge: the models wrote a player for that specific word game. E.g. https://github.com/rayonnant-ai/aicc/blob/main/wordgempuzzle...
Generally speaking, would you take a conclusion based only an event that happened once?
Seems like in agentic work flow the qween flash and Deepseek flash models are quite good.
Fits with another comment from yesterday on here who said the flash models are just better at tool calling.
Planning with gpt55 and implementation with a flash model could be bang for the buck route.
In my experience benchmarks are pretty meaningless.
Not only is performance dependent on the language and tasks gives but also the prompts used and the expected results.
In my own internal tests it was really hard to judge whether GPT 5.5 or Opus 4.7 is the better model.
They have different styles and it's basically up to preference. There where even times where I gave the win to one model only to think about it more and change my mind.
At the end of the day I think I slightly prefer Opus 4.7.
I think benchmarks are improving and will always have value, but it's the equivalent to someone's college and GPA for an entry level job application.
It's a strong signal for a job, but the soft skills are sometimes going to get Claude Opus 4.6 a job over smarter applicants. That's what we'd really like to measure objectively, and are actively working on.
In addition, the harness around these models do a lot of work and changes the outcome significantly.
I just had an issue where Claude CLI with Opus 4.7 High could not figure out why my Blazor Server program was inert, buttons didn't do anything etc. After several rounds, I opened the web console and found that it failed to load blazor.js due to 404 on that file. I copied the error message to Claude CLI and after another several unproductive rounds I gave up.
I then moved the Codex, with ChatGTP 5.5 High. I gave it the code base, problem description and error codes. Unlike Claude CLI it spun up the project and used wget/curl to probe for blazor.js, and found indeed it was not served. It then did a lot more probing and some web searches and after a while found my project file was missing a setting. It added that and then probed to verify it worked.
So Codex fixed it in about 20 minutes without me laying hands on it (other than approve some program executions).
However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available.
For reference this was me just trying to see how good the vibecoding experience is now, so was trying to do this as much hands-off as possible.
You can fix Claude's laziness by modifying the system prompt. https://gist.github.com/chyzwar/99fe217c3ed336f57c74dcffe371...
> However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available.
A model that can more effectively make use of the tools presented to it is going to be better. You're not wrong about the system prompt; these can have quite a pronounced effect, especially when what the agent is bridging to is not just a case of bash + read/write; you need the prompt (and tool descriptions) to steer and reinforce what it should actually do because most models are heavily over-trained on executing bash lines.
When it comes to more basic agent usage that just runs in a terminal and executes bash ultimately most models are going to do just fine as long as you provide the very basics.
Regarding your case in this post it could be any number of issues: The provider being over-provisioned, leaving less time for your case, the model just not being particularly great, your previous context (in your original session) subtly nudging the model to not do the correct thing, and so on.
The truth is that you simply can't really know what the exact cause of this behavior you experienced is, but I think you're also working hard to cope on behalf of Anthropic.
All in all I think you're placing a bit too much faith in agents and their effect. If you slim down and use something like Pi instead you'll likely get a more accurate sense of what agents do and don't do, and how it affects things. You can then also add your own things and experiment with how that impacts things as well.
I've written an agent that only allows models to send commands to Kakoune (a niche text editor that I use) and can say that building an agent that just executes bash + read/write in 2026 is probably the easiest proposition ever. I say this because a lot of the work I've had to do has been to point them in the direction of not constantly trying to write bash lines; models all seem to tend towards this so if you just wanted to do that anyway most of your work is already done. The vast amount of the work in those types of agents is better spent fixing model quirks and bad provider behavior in terms of input/output.
> A model that can more effectively make use of the tools presented to it is going to be better.
Of course. What I was getting at is that if the harness A doesn't expose certain useful tools that harness B does, it doesn't matter if the model could use those tools.
> I think you're also working hard to cope on behalf of Anthropic
How on earth did you get that out my post? I was just reporting on a recent experience I had, to make a point that harness+model is a very different thing from just model when it comes to evaluating effectiveness and quality of output.
I actually noticed this too. GPT 5.5 is much more "hands on" with calling tools to debug issues and verify results. I did all my tests in Cursor but I don't know if they use a different system prompt for each model.
Are you tests and results open source?
Test result summaries are openly available, test environments are not.
Curious, why can't you provide a measurement of context size for a human. Surely there must be enough science to make a good approximater.
Any thoughts on using it on Fireworks? It's extremely fast there.
I'm not sure how many of our requests got routed to Fireworks -- for our testing, we set preferences for routing to providers with the highest advertised quantizations / highest reasoning mode support / or preferably the model developer itself.
While it may be possible to get better numbers from certain providers, we try to establish a common baseline. I.e. if we measure that Kimi K2.6 averages 450s on a task and GLM 5.1 averages 400s, you might be able to improve that number on a provider like Fireworks but GLM 5.1 would also likely be 10% faster on the premium provider. This is a caveat worth considering when comparing to proprietary model speeds on the site, though.