Anthropic has again changed the set of benchmarks they use[0]. This time they have also moved all benchmark scores to the PDF. At a glance it looks like it gains about ~5-10% over other models. the speed is about the same as opus >=4.5, sonnet 4.5, and double the speed of opus <=4.1
Mythos 5 Fable 5 MythosPrev Opus 4.8 GPT-5.5 Gemini 3.1 Pro
SWE-bench Pro 80.3 80 77.8 69.2 58.6 54.2
SWE-bench Ver 95.5 95 93.9 88.6 - 80.6
Terminal-Bench 88.0 84.3 - 82.7 83.4 -
BrowseComp (Single-Agent) 88.0 - 87.9 84.3 84.4 85.9
BrowseComp (Multi-Agent) 93.3 - - 88.5 - -
HLE (No tools) 59.0 - 56.8 49.8 41.4 44.4
HLE (Tools) 64.5 - 64.7 57.9 52.2 51.4
CharXiv Reasoning (No tools) 88.9 - 86.2 80.5 - -
CharXiv Reasoning (Tools) 93.5 - 92.5 89.9 - -
BioMystery Bench (Human) 83.9 - 82.6 80.4 - -
BioMystery Bench (Hard) 46.1 - 29.6 40.0 - -
OSWorld-Verified 85.0 85.0 85.4 83.4 78.7 76.2*
CritPt 28.6 - 20.9 27.1 17.7 -
ArxivMath 78.5 68.7 71.8 71.5 64.0 -
[0] https://news.ycombinator.com/item?id=48312633Edit: Also in the system card... "we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).
...
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."
It's announced as a revolution but when you look at those benchmarks it surely looks like an iteration.