My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

This is my theory too. There’s a predictable cycle where the models “get worse.” They probably don’t. A lot of people just take a while to really hit hard against the limitations.

And once you get unlucky you can’t unsee it.

So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.

I have to agree with OP, in my experience it is usually more productive to start over than to try correcting output early on. deeper into a project and it gets a bit harder to pull off a switch. I sometimes fork my chats before attempting to make a correction so that I can resume the original just in case (yes, I know you can double-tap Esc but the restoration has failed for me a few times in the past and now I generally avoid it)

I also think some of this stems from the default 1m context window. Performance starts to degrade when context size increases, and each token over (i think the level is) 400k counts more towards your usage limit. Defaulting to 1m context size, if people arent carefully managing context (which they shouldnt ever have to in an ideal world), they would notice somewhat degraded performance and increased token usage regardless.

I can't remember what the technique is called, but back in the GPT 4 days there was a paper published about having a number of attempts at responding to a prompt and then having a final pass where it picks the best one. I believe this is part of how the "Pro" GPT variant works, and Cursor also supports this in a way (though I'm not sure if the auto pick best one at the end is part of it - never tried)

I have found Claude to be especially unpredictable. I've mostly switched to GPT-5.4 now - although it's slightly less capable, it's massively more reliable.

[deleted]

I think they are routing to cheaper models that present themselves as e.g. Opus. I add to prompts now stuff to ensure that I am not dealing with an impostor. If it answers incorrectly, I terminate the session and start again. Anthropic should be audited for this.

you probably could have written the low stakes productivity app in a fraction of the time you wasted on this.

Or learnt to use an existing one.

I vibed a low stakes budgeting app before realising what I actually needed was Actual Budget and to change a little bit how I budget my money.

> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.

I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.

Er, no, I am fully aware that LLMs have always been non-deterministic.

Your argument seems to be that a statistically-improbable number of people all experienced ultimately- randomly-poor outputs, leading to only a misperception of model degradation… but this is not supported by reality, in which a different cause was found, so I was trying to connect your dots.

Not everyone is reporting and the number of users is not consistent. On the former the noisiest will always be those that experience an issue while on the latter there are more people than ever using Claude Code regularly.

Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.

I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.

No, that is not my argument, in fact I don't have any argument whatsoever. It was just a plausible observation that I felt like sharing. There's nothing further to read into it, I don't have a horse in this race.

Not really, they said "some of this a perceived quality drop". That's almost certainly correct, that _some_ of it is that.

When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.

[1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.

I wonder how well the "good" versions worked if you threw awkward edge cases at it.