We've spent some time trying to understand this anomaly, even re-running Sonnet 4.6 through our evaluations to see if that would bring down its scores... and it didn't. I don't know what they did differently, but it's basically Opus 4.6 with more temperature variability (some great responses, some less great, with an approximately frontier median response in agentic work specifically). It is smart, methodical and excellent at tool calling in our custom environments.

We now use Sonnet 4.6 for a number of internal use cases we wouldn't have considered otherwise.

That tracks with my experience.

4.7 was so bad, I locked a bunch of my machines to 4.6.

I haven’t bothered locking the 4.8 machines to 4.6. There was a HN thread a while back where they run swe bench a few times a day and measure success rate and latency. It showed opus getting significantly dumber for the week before a recent launch.

It wouldn’t surprise me if they’re quantizing to improve margins or to hype models in comparative testing in order to defraud investors at IPO.

Or, maybe QA is hard. Anyway, I think they hit a performance wall sometime at or before 4.6.

Doesn't track with mine. I've been stuck with Sonnet 4.6 with one of the clients I work for. It writes code fine, but it's not nearly as good as the more recent models for everything else. It's fairly common for it to suddenly go off the rails for no good reason, so I can't really trust it with agentic loops. It's also not very good at diagnosing non-trivial issues. It's not uncommon for it to suggest whole lists of irrelevant / nonsensical reasons for something not working. Then I copy/paste the code and some context into chatgpt and it hones in onto the correct issue right away, even with inferior tooling.