On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".
In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.
What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.
[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...
Agreed, my vibes tell me 4.6 is a better coder than 4.7. 4.7 is a much better strategic thinker and maintains overall "better architecture" than 5.5. 5.5 is way better than either at coding, but more expensive. So I have 4.7 do the planning/architecture, 4.6 does the coding, then 5.5 critiques and fixes it.
This is my exact vibesperience
Agreed, these are my vibes too. It feels much better to do planning and strategy and architecture etc. with Opus 4.7 than GPT-5.5. GPT just feels like a robot that gets instructions and does exactly that. Opus feels like an almost human that sometimes has actually good ideas and pushes back on bad ideas.
So for now its planning/architecture/strategy -> Opus. Pure coding -> GPT.
Helps with agentic coding that GPT is much roomier with the tokens you get.