Around February, Opus 4.6 was excellent. Smart, fast, proactive. Then it got lobotomized and it's never been the same after that nerf. 4.7 came along and it too was disappointing—not unlike 4.8, which despite feeling a smidge smarter, tends to write word salad and is basically unusable for some workflows.
Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...
It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time
This is exactly what I find frustrating. I get comfortable with the latest model X. Then a new sparkly model Y launches. I am like, I don't need your new fangled Y, that consumes more tokens. My needs are small and i am happy with the older X.
But then X starts to degrade. At first subtly, and then drastically. So then I am forced to upgrade to Y.
What I do not understand is:
> is this a sneaky way for companies to push users up the chain?
> Or is this a genuine fault in model design/resource allocation?
I suppose it is both. Basically all frontier models are inference-time compute bound thanks to reasoning. And actual reasoning traces are locked behind closed doors at all American labs. So whenever they want to push a new model and need to give it hardware, it would make sense to cut into the reasoning budgets of older models. Users will not be able to see that directly, it will only become apparent on high-end, difficult tasks - exactly the kind of tasks where the provider wants you to use the new model anyway, so they can further improve it.
The economics of AI fall apart if you stay with the old model forever. No need to buy new GPUs or build new data centers.
So the latest in planned obsolescence are LLM models.
[dead]
Can you think of many examples of a SaaS provider who regularly keeps old versions of a product around for customers to use?
A far more common scenario is that new versions are rolled out to everyone, without offering a choice, as soon as they're considered stable.
Older versions consume resources and require staff to spend time on operating and supporting them. Those resources could be used to run a newer version.
The tl;dr is the simple economics of any SaaS product.
If you want to be able to run old versions indefinitely and control the resources assigned to it, you need to self-host (an open model).
> Can you think of many examples of a SaaS provider who regularly keeps old versions of a product around for customers to use?
Sure. Blender and Ubuntu offer long-lived old versions of their software that get regular fixes.
Neither Blender nor Ubuntu are SaaS. You're just confirming my point: if you want to run old versions of software, you need to host it yourself.
february was some kind of nirvana. i do think claude code versions and what is introduced at that level is/was relevant.
but 4.8 xhigh w/ ultracode to me is just about Fable level (w/ some agents harness tweaking).
but have to switch to 4.7 xhigh and 4.6 max quite often these days.
I miss the old Opus 4.6 too. They're probably quantizing the old models.
K/V cache compression and context shortening / summarisation. And yes, I suspected Quants too.
All of these discussions of models being "nerfed" reminds me of discussions among audiophiles "this cable sounds so much better than this other one, it's night and day, ferrari versus honda civic"
Yet when you do blind tests they can't tell the difference between a $1000 cable and a $1 one.
I bet if you do blind tests between GPT-5.3, 5.4 and 5.5 most would struggle to tell them apart, yet they are certain that "5.5 was nerfed 1 week after release, it's so obvious, it was John Carmack, now it can barely write a for loop"
Actually, ELO rankings done blinded on models do vary: https://the-frontier.app, that said, your point looks accurate as far as 5.3 - 5.5 on this chart, 40 to 50 point ELO gain.
I find I have to argue with 5.5 less than 5.3, and I therefore use it when I could reach for 5.3, but I don't think it's a major difference.
Electric Light Orchestra really stole Arpad Elo's thunder.
Exactly this. And it's not really possible to do repeatable trials, it's all just vibes. People have very little awareness of their own cognitive biases.
And companies have high awareness of this all.
They have a way to decrease cost and probably increase token consumption, with gradual changes and no abrupt jump in capabilities, and users have no way to reliably detect it.
Market will advantage companies that do it.
And they are in the best position to automate online narrative shift (the real LLM killer application IMO) towards "Users are imagining it".
That's a pretty shallow dismissal, and I bet you $100 I can tell you which model I'm talking to between 4.6 and 4.8 without looking or asking after a handful of messages.
Anthropic famously had a terrible outage back when 4.6 was the latest and greatest, and it was never the same after it came back.
All evidence suggests they simply don't have the compute to keep serving their best models at their most powerful.
You will be amused to hear that when Anthropic "refreshed" 4.6 on AWS Bedrock I found it in my tests and wrote about it – and they actually rolled it back. This is how much non–coding tests may tell you about the model.
So Bedrock 4.6 is old school Opus?
I know you can point Claude code at Bedrock.. might be worth a play.