I was surprised by Sonnet's performance, as well. And, it's difficult to say any model is really worse or better based on one attempt across nine bugs (several of which have proven to be intractable for all models, thus far). But, in this particular set of problems, Haiku seems to have done a little bit better. But, self-hosted Qwen 3.6 and Gemma 4 also seem to have done better than Sonnet or Haiku, which is surprising. So, there are surely confounding variables here, but I don't know what they are yet. More testing and more analysis of the data will probably reveal it. It may be that using the Anthropic models in the simpler API harness will unleash their power, maybe there are guardrails baked into the Claude Code system prompt that make the small models too conflicted about right and wrong to answer clearly.

DeepSeek was actually the `deepseek-chat` alias in the API (which dynamically chooses the model based on info I don't know), but when I checked the usage, it was all DeepSeek V4 Pro for the benchmark. I later changed DeepSeek to explicitly use Pro for subsequent experiments, so future runs will be explicitly Pro.

I probably will do a test of smaller models, exclusively, at some point. But, I figured DeepSeek V4 Pro is so cheap, especially given their caching effectiveness and cached input pricing, for my own use I'll probably just use DeepSeek V4 Pro when I need a cheap, fast, near-frontier model.