Hacker News

My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

coffeefirst 16 hours ago [ - ]

This is my theory too. There’s a predictable cycle where the models “get worse.” They probably don’t. A lot of people just take a while to really hit hard against the limitations.

And once you get unlucky you can’t unsee it.

skirmish 17 hours ago [ - ]

So will we have to do what image generation people have been doing for ages: generate 50 versions of output for the prompt, then pick the best manually? Anthropic must be licking its figurative chops hearing this.

motoroco 17 hours ago [ - ]

I have to agree with OP, in my experience it is usually more productive to start over than to try correcting output early on. deeper into a project and it gets a bit harder to pull off a switch. I sometimes fork my chats before attempting to make a correction so that I can resume the original just in case (yes, I know you can double-tap Esc but the restoration has failed for me a few times in the past and now I generally avoid it)

zormino 8 hours ago [ - ]

I also think some of this stems from the default 1m context window. Performance starts to degrade when context size increases, and each token over (i think the level is) 400k counts more towards your usage limit. Defaulting to 1m context size, if people arent carefully managing context (which they shouldnt ever have to in an ideal world), they would notice somewhat degraded performance and increased token usage regardless.

afro88 11 hours ago [ - ]

I can't remember what the technique is called, but back in the GPT 4 days there was a paper published about having a number of attempts at responding to a prompt and then having a final pass where it picks the best one. I believe this is part of how the "Pro" GPT variant works, and Cursor also supports this in a way (though I'm not sure if the auto pick best one at the end is part of it - never tried)

voxgen 7 hours ago [ - ]

I have found Claude to be especially unpredictable. I've mostly switched to GPT-5.4 now - although it's slightly less capable, it's massively more reliable.

20 hours ago [ - ]

[deleted]

varispeed 4 hours ago [ - ]

I think they are routing to cheaper models that present themselves as e.g. Opus. I add to prompts now stuff to ensure that I am not dealing with an impostor. If it answers incorrectly, I terminate the session and start again. Anthropic should be audited for this.

billywhizz 15 hours ago [ - ]

you probably could have written the low stakes productivity app in a fraction of the time you wasted on this.

Or learnt to use an existing one.

I vibed a low stakes budgeting app before realising what I actually needed was Actual Budget and to change a little bit how I budget my money.

gilrain 20 hours ago [ - ]

> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.

I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.

bityard 19 hours ago [ - ]

Er, no, I am fully aware that LLMs have always been non-deterministic.

gilrain 19 hours ago [ - ]

Your argument seems to be that a statistically-improbable number of people all experienced ultimately- randomly-poor outputs, leading to only a misperception of model degradation… but this is not supported by reality, in which a different cause was found, so I was trying to connect your dots.

zamadatix 17 hours ago [ - ]

Not everyone is reporting and the number of users is not consistent. On the former the noisiest will always be those that experience an issue while on the latter there are more people than ever using Claude Code regularly.

Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.

I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.

bityard 18 hours ago [ - ]

No, that is not my argument, in fact I don't have any argument whatsoever. It was just a plausible observation that I felt like sharing. There's nothing further to read into it, I don't have a horse in this race.

furyofantares 18 hours ago [ - ]

Not really, they said "some of this a perceived quality drop". That's almost certainly correct, that _some_ of it is that.

When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.

[1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.

pydry 19 hours ago [ - ]

I wonder how well the "good" versions worked if you threw awkward edge cases at it.