Hacker News

Opus 4.5 ~= Opus 4.6 high. Opus 4.5 was nerfed just before or after the release of 4.6.

The models don’t change.

On paper. There's huge financial incentive to quantize the crap out of a good model to save cash after you've hooked in subscriptions.

armchairhacker 12 hours ago [ - ]

And there’s an incentive to publish evidence of this to discourage it, do you have any?

TeMPOraL 12 hours ago [ - ]

Models aren't just big bags of floats you imagine them to be. Those bags are there, but there's a whole layer of runtimes, caches, timers, load balancers, classifiers/sanitizers, etc. around them, all of which have tunable parameters that affect the user-perceptible output.

natebc 12 hours ago [ - ]

There really always is a man behind the curtain eh?

coldtea 11 hours ago [ - ]

Often it's literally just that:

https://www.msn.com/en-us/money/other/ai-startup-backed-by-m...

TeMPOraL 11 hours ago [ - ]

It's still engineering. Even magic alien tech from outer space would end up with an interface layer to manage it :).

ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.

woadwarrior01 11 hours ago [ - ]

There's this[1]. Model providers have a strong incentive to switch (a part of) their inference fleet to quantized models during peak loads. From a systems perspective, it's just another lever. Better to have slightly nerfed models than complete downtime.

[1]: https://marginlab.ai/trackers/claude-code/

nl 11 hours ago [ - ]

So - as the charts say - no statistical difference?

Isn't this link am argument against the point you are making?

withinboredom 10 hours ago [ - ]

The chart doesn't cover the 4.6 release which was in the end of December/early January time frame. So, it's hard to tell from existing data.

coldtea 11 hours ago [ - ]

Anybody with more than five years in the tech industry has seen this done in all domains time and again. What evidence you have AI is different, which is the extraordinary claim in this case...

seunosewa 5 hours ago [ - ]

Or just change the reasoning levels.

esskay 13 hours ago [ - ]

Real world usage suggests otherwise. It's been a known trend for a while. Anthropic even confirmed as such ~6 months ago but said it was a "bug" - one that somehow just keeps happening 4-6 months after a model is released.

yorwba 11 hours ago [ - ]

Real world usage is unlikely to give you the large sample sizes needed to reliably detect the differences between models. Standard error scales as the inverse square root of sample size, so even a difference as large as 10 percentage points would require hundreds of samples.

https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)

nextaccountic 10 hours ago [ - ]

It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark

yorwba 10 hours ago [ - ]

Any other benchmark at that sample size would have similarly huge error bars. Unless Anthropic makes a model that works 100% of the time or writes a bug that brings it all the way to zero, it's going to work sometimes and fail sometimes, and anyone who thinks they can spot small changes in how often it works without running an astonishingly large number of tests is fooling themselves with measurement noise.

fer 13 hours ago [ - ]

They do. I'm currently seeing a degradation on Opus 4.6 on tasks it could do without trouble a few months back. Obvious I'm a sample of n=1, but I'm also convinced a new model is around the corner and they preemptively nerf their current model so people notice the "improvement".

stavros 13 hours ago [ - ]

Make that 2, I told my friends yesterday "Opus got dumb, new model must be coming".

arcanemachiner 12 hours ago [ - ]

I swear that difference sessions will route to different quants. Sometimes it's good, sometimes not.

scrollop 5 hours ago [ - ]

You sure about that?

https://marginlab.ai/trackers/claude-code/

withinboredom 3 hours ago [ - ]

Well, I don't see 4.5 on there ... so I'm not sure what you're trying to say.

And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...

50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.

coldtea 11 hours ago [ - ]

Only nominally...

pixel_popping 13 hours ago [ - ]

Oh yes, they do.

girvo 13 hours ago [ - ]

I think the conspiracy theories are silly, but equally I think pretending these black boxes are completely stable once they're released is incorrect as well.

coldtea 10 hours ago [ - ]

No conspiracy theories. Companies being scumbags, cutting corners, and doctoring benchmarks while denying it. Happens since forever.