This looks quite fantastic!
Nice improvements in scores across the board, e.g.
> On coding, it [the new Sonnet 3.5] improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding.
I've been using Sonnet 3.5 for most of my AI-assisted coding and I'm already very happy (using it with the Zed editor, I love the "raw" UX of its AI assistant), so any improvements, especially seemingly large ones like this are very welcome!
I'm still extremely curious about how Sonnet 3.5 itself, and its new iteration are built and differ from the original Sonnet. I wonder if it's in any way based on their previous work[0] which they used to make golden-gate Claude.
[0]: https://transformer-circuits.pub/2024/scaling-monosemanticit...
I'm waiting for Aider benchmark
It's out, and improved!
It went from 77.4% to 84.2%, skipping past O1-preview which is at 79.7%
Source: https://aider.chat/docs/leaderboards/