I'm the author of the report in there. The stop-phrase-guard didn't get attached but here it is: https://gist.github.com/benvanik/ee00bd1b6c9154d6545c63e06a3...

You can watch for these yourself - they are strong indicators of shallow thinking. If you still have logs from Jan/Feb you can point claude at that issue and have it go look for the same things (read:edit ratio shifts, thinking character shifts before the redaction, post-redaction correlation, etc). Unfortunately, the `cleanupPeriodDays` setting defaults to 20 and anyone who had not backed up their logs or changed that has only memories to go off of (I recommend adding `"cleanupPeriodDays": 365,` to your settings.json). Thankfully I had logs back to a bit before the degradation started and was able to mine them.

The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/

The "this test failure is preexisting so I'm going to ignore it" thing has been happening a lot for me lately, it's so annoying. Unless it makes a change and then immediately runs tests and it's obvious from the name/contents that the failing test is directly related to the change that was made it will ignore it and not try to fix.

I'm curious about your subscription/API comparison with respect to thinking. Do you have a benchmark for this, where the same set of prompts under a Claude Code subscription result in significantly different levels of effective thinking effort compared to a Claude Code+API call?

Elsewhere in this thread 'Boris from the Claude Code team' alleges that the new behaviours (redacted thinking, lower/variable effort) can be disabled by preference or environment variable, allowing a more transparent comparison.