Yet https://marginlab.ai/trackers/claude-code/ says no issue.

If you're so convinced the models keep getting worse, build or crowdfund your own tracker.

If I'm reading that page correctly, then the benchmark results don't cover the interesting "mid February" inflection point noted in the article/report. The numbers appear to begin after the quality drop began. Moreover, the daily confidence interval seems to be stupidly wide, with a confidence interval between 42% and 69%?

The "Other metrics" graphs extend for a longer period, and those do seem to correlate with the report. Notably, the 'input tokens' (and consequently API cost) roughly halve (from 120M to 60M) between the beginning of February and mid-March, while the number of output tokens remains similar. That's consistent with the report's observation that new!Opus is more eager to edit code and skips reading/research steps.

Came here to post this as well, and it's interesting to see how benchmarks don't always track feelings. Which is one of the things people say in favor of Anthropic Models!

[deleted]