(Being true to the HN guidelines, I’ve used the title exactly as seen on the GitHub issue)
I was wondering if anyone else is also experiencing this? I have personally found that I have to add more and more CLAUDE.md guide rails, and my CLAUDE.md files have been exploding since around mid-March, to the point where I actually started looking for information online and for other people collaborating my personal observations.
This GH issue report sounds very plausible, but as with anything AI-generated (the issue itself appears to be largely AI assisted) it’s kind of hard to know for sure if it is accurate or completely made up. _Correlation does not imply causation_ and all that. Speaking personally, findings match my own circumstances where I’ve seen noticeable degradation in Opus outputs and thinking.
EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.
What I've noticed is that whenever Claude says something like "the simplest fix is..." it's usually suggesting some horrible hack. And whenever I see that I go straight to the code it wants to write and challenge it.
That is the kind of thing that I've been fighting by being super explicit in CLAUDE.md. For whatever reason, instead of being much more thorough and making sure that files are being changed only after fully understanding the scope of the change (behaviour prior to Feb/Mar), Claude would just jump to the easiest fix now, with no backwards compatibility thinking and to hell with all existing tests. What is even worse is I've seen it try and edit files before even reading them on a couple of occasions, which is a big red flag. (/effort max)
Another thing that worked like magic prior to Feb/Mar was how likely Claude was to load a skill whenever it deduced that a skill might be useful. I personally use [superpowers][1] a lot, and I've noticed that I have to be very explicit when I want a specific skill to be used - to the point that I have to reference the skill by name.
[1]: https://github.com/obra/superpowers
I did not use the previous version of Opus to notice the difference, but Sonnet 4.6 seems optimized to output the shortest possible answer. Usually it starts with a hack and if you challenge it, it will instead apologize and say to look at a previous answer with the smallest code snippet it can provide. Agentic isn't necessarily worse but ideating and exploring is awful compared to 4.5
I did my usual thing today where I asked a Sonnet 4.6 agent to code review a proposed design plan that was drafted by Opus 4.6 - I do this lately before I delved into the implementation. What it came back with was a verbose output suggesting that a particular function `newMoneyField` be renamed throughout the doc to a name it fabricated `newNumeyField`. And the thing was that the design document referenced the correct function name more than a few dozen times.
This was a first for me with Sonnet. It completely veered off the prompt it was given (review a design document) and instead come out with a verbose suggestion to do a mechanical search and replace to use this newly fabricated function name - that it event spelled incorrectly. I had to Google numey to make sure Sonnet wasn't outsmarting me.
Superpowers, Serena, Context7 feel like requried plugins to me. Serena in particular feels like a secret weapon sometimes. But superpowers (with "brainstorm" keyword) might be the thing that helps people complaining about quality issues.
lol this one time Claude showed me two options for an implementation of a new feature on existing project, one JavaScript client side and the other Python server side.
I told it to implement the server side one, it said ok, I tabbed away for a while, came to find the js implementation, checking the log Claude said “on second thought I think I’ll do the client side version instead”.
Rarely do I throw an expletive bomb at Claude - this was one such time.
Using superpowers in brainstorm mode like the parent suggested would have resulted in a plan markdown and a spec markdown for the subagents to follow.
Dunno man, Claude had a spec (pretty sure I asked it to consider and outline both options first) or at least clear guidance and decided to YOLO whatever it wanted instead.
It’s always “you’re using the tool wrong, need to tweak this knob or that yadda yadda”.
this prompt is actually in claude cli. it says something like implement simplest solution. dont over abstract. On my phone but I saw an article mention this in the leak analysis.
If that tracker is using paid tokens, as opposed to the regular subscription, then there's no financial incentive for Antrophic to degrade their thinking, so their benchmark likely would not be affected by the cost-cutting measures that regular users face.
Also, it's probably very easy to spot such benchmarks and lock-in full thinking just for them. Some ISPs do the same where your internet speed magically resets to normal as soon as you open speedtest.net ...
I haven't noticed any changes but my stuff isn't that complex. People are saying they quantized Opus because they're training the next model. No idea if that's true... It's certainly impacting my decision to upgrade to Max though. I don't want to pay for Opus and get an inferior version.
I haven't noticed any changes either, but I noticed that opus 4.6 is now offered as part of perplexity enterprise pro instead of max, so I'm guessing another model is on the horizon
I just finished reading the full analysis on GitHub.
> When thinking is deep, the model resolves contradictions internally before producing output.
> When thinking is shallow, contradictions surface in the output as visible self-corrections: "oh wait", "actually,", "let me reconsider", "hmm, actually", "no wait."
Yeah, THIS is something that I've seen happen a lot. Sometimes even on Opus with max effort.
I missed that from the long issue, thanks for pointing it out! My experience with Opus today was riddled with these to the point where it was driving me completely mental. I've rarely seen those self-contradictions before, and nothing on my setup has changed - other than me forcing Opus at --effort max at startup.
I wonder if this is even more exaggerated now through Easter, as everyone’s got a bit extra time to sit down and <play> with Claude. That might be pushing capacity over the limit - I just don’t know enough about how Antropic provision and manage capacity to know if that could be a factor. However quality has gotten really bad over the holiday.
Cannot say I've noticed, but I run virtually everything through plan mode and a few back and forth rounds of that for anything moderately complex, so that could be helping.
I used to one-shot design plans early in the year, but lately it is taking several iterations just to get the design plan right. Claude would frequently forget to update back references, it would not keep the plan up to date with the evolving conversation. I have had to run several review loops on the design spec before I can move on to implementation because it has gotten so bad. At one point, I thought it was the actual superpowers plugin that got auto-updated and self-nerfed, but there weren't any updates on my end anyway. Shrug.