That tracks with my experience.

4.7 was so bad, I locked a bunch of my machines to 4.6.

I haven’t bothered locking the 4.8 machines to 4.6. There was a HN thread a while back where they run swe bench a few times a day and measure success rate and latency. It showed opus getting significantly dumber for the week before a recent launch.

It wouldn’t surprise me if they’re quantizing to improve margins or to hype models in comparative testing in order to defraud investors at IPO.

Or, maybe QA is hard. Anyway, I think they hit a performance wall sometime at or before 4.6.

Doesn't track with mine. I've been stuck with Sonnet 4.6 with one of the clients I work for. It writes code fine, but it's not nearly as good as the more recent models for everything else. It's fairly common for it to suddenly go off the rails for no good reason, so I can't really trust it with agentic loops. It's also not very good at diagnosing non-trivial issues. It's not uncommon for it to suggest whole lists of irrelevant / nonsensical reasons for something not working. Then I copy/paste the code and some context into chatgpt and it hones in onto the correct issue right away, even with inferior tooling.