Hacker News

kgeist 19 hours ago [ - ]

Every new proprietary model is "groundbreaking" and "look, it just solved task X that no other model could solve," only to be referred to as "that crappy previous-generation model" a month later.

So yeah, I'm totally fine using Kimi-2.7, GLM-5.2 or Deepseek-v4. I think we've already hit the ceiling and most improvements now seem to be from harness improvements and slightly better RL to improve reasoning/tool calling.

jbverschoor 18 hours ago [ - ]

Not only that, but to me it seems that after a week the intelligence is being downscaled or routed. Maybe because of lack of capacity

conception 7 hours ago [ - ]

You can check https://marginlab.ai/trackers/codex/

It’s pretty good at catching when performance is degraded. It was for a week or so before Fable launched for instance, probably due to a/b testing or capacity as you noted.

matheusmoreira 17 hours ago [ - ]

There's at least the possibility that they intentionally degrade the models as time passes. We can't really verify that we're getting what we're paying for all of the time. All the more reason to invest in local inference.

inigyou 16 hours ago [ - ]

What if the new model is exactly as good as the last model on launch day but better than the last model was on the new model's launch day because it was degraded? Every single time?

foo42 13 hours ago [ - ]

Makes me think of [shepherd tones](Shepard tone - Wikipedia https://share.google/xooRbF7wIIhcsTt2J) which sounds like they're rising in pitch indefinitely

inigyou 2 hours ago [ - ]

why are you linking to Wikipedia in invalid markdown format, which wouldn't work on HN even if it was valid, to a site called share dot google?

no-name-here 13 hours ago [ - ]

There are lots of benchmarks to compare the absolute values of different models on the same scale (as opposed to vibes (my apologies for the shorthand), etc.).

matheusmoreira 14 hours ago [ - ]

The thought has definitely crossed my mind. I don't think it's true because there's definitely an improvement when new models are released.

Maybe the truth is the newest models aren't actually as impressive as we thought. Maybe our perception of progress is being manipulated via months of gradual, silent and unverifiable degradation.

LPisGood 15 hours ago [ - ]

People talk about this a lot. What I have never seen is a discussion of methods they might employ to degrade the models.

Let’s say I’m a bad faith LLM operator, and I want to degrade my model so the next release looks better and people want to switch to the more expensive one. How would I do that?

nessex 15 hours ago [ - ]

They would quantize the model. That'd make it cheaper to run, and have slightly worse output but it would still generate outputs with a similar feel, derived from a compressed version of the same knowledge base etc.

They wouldn't even need to do this uniformly, quantized versions of the model could be routed only a subset of the requests. They could do this to nerf the old model, or more likely just to give themselves more hardware to run the new one on by handling more requests on less hardware. Or to handle increased request volume as traffic ramps up faster than hardware can be provisioned.

Playing with local models at various quants, the degradation can be hard to spot. Sometimes it's only noticeable in aggregate. And even then, you never really know if you just got unlucky with a bad response due to RNG.

I've had Opus 4.6 fall into some weirdly incoherent loops that I rarely see from even Sonnet, that felt like the kind of thing I got frequently with Qwen3.5 9B on local. And the above applies... Was that just bad RNG? Or was my request to Opus routed to some lower quality variant? There's no great way for me to tell for any given request, nor any way to guarantee Anthropic _didn't_ do that.

OccamsMirror 13 hours ago [ - ]

I have had the same experiences you've had with 4.6 and it was ever since they brought out 4.7. It's fairly obvious they're doing something like you've said here.

nessex 13 hours ago [ - ]

Forgot to mention, but it was after the 4.7 release when I was still using 4.6 that I saw those loops too... Before that, 4.6 had been a pretty seamless experience.

tsss 9 hours ago [ - ]

And guess what all the providers of open models do: They quantize, badly.

csunbird 8 hours ago [ - ]

This is why you pay premium for trusted providers, who are verified to not quantize

maybe_pablo 15 hours ago [ - ]

Weight quantization, n-expert capping, routing to smaller model, context window truncation, aggressive sampling constraints, lossy speculative decoding and probably more.

trollbridge 9 hours ago [ - ]

I can't prove any of it, but it sure feels like that happens sometimes on Anthropic's platform.

I don't seem to get any of this with GPT-5.5 or GPT-5.5-Pro (not that I use 5.5-Pro enough to know for sure, but when I do use it, it never seems nerfed).

alfiedotwtf 12 hours ago [ - ]

I'm pretty sure you could do n-expert capping on any MoE model with only a handful lines of changes to ik_llama.cpp, but yeah... my bet is the have various quantisations and run the lower ones at peak (along with different system prompts i.e we're GPU-bound right now. Get to the point with less chatter)

Tepix 15 hours ago [ - ]

Use quantisation.

manyatoms 16 hours ago [ - ]

Unless what you're getting is really explicitly spelled out in a contract, you should flatly assume that they're doing whatever they like whenever they like.

OtomotO 15 hours ago [ - ]

Even if it's in the contract, but can't be verified.

taytus 17 hours ago [ - ]

At current prices, and considering these OS Models' performance, investing in local inference sounds like a bad idea.

matheusmoreira 17 hours ago [ - ]

Current prices are insane but at this point I'm starting to feel like it's an existential issue. I'm not a US citizen. At any point the USA could come up with some arbitrary export controls. Not having a computer capable of running at least Qwen is starting to actually seem risky to me.

At least it's going to be usable as a very high end gaming PC.

awakeasleep 16 hours ago [ - ]

Why would you buy and build everything before the low probability catastrophe strikes, though? You don’t get any benefit from switching early and you pay a big opportunity cost.

Lapel2742 15 hours ago [ - ]

> low probability catastrophe

There is also a low probability that someone enters peace negotiations solely to threaten the negotiators with death, yet here we are. With these guys it is: Better safe than sorry.

inigyou 16 hours ago [ - ]

because as soon as it strikes computer hardware will be completely unavailable to buy?

CamperBob2 15 hours ago [ - ]

Also, there's a nontrivial learning curve involved in running your own inference server, once you move past the casual-goofing-around-with-llama-server stage. If you care about not being a sharecropper on Sam's or Dario's plantation, you should consider learning the ropes. Even if you don't put these skills to immediate use in your day job.

I didn't appreciate this until I started down that road myself.

matheusmoreira 14 hours ago [ - ]

> If you care about not being a sharecropper on Sam's or Dario's plantation

Couldn't have put it better myself. That's what all this comes down to. Owning the hardware, owning the inference. Not perpetually renting them out on a meter like in the dystopian future they're envisioning.

inigyou 9 hours ago [ - ]

You also have the option to not use AI

matheusmoreira 5 hours ago [ - ]

Yeah but the truth is I don't want to go back to the pre-LLM world. I've been programming alone for over ten years. Having a coding buddy to talk to, collaborate with or just bounce ideas off of quite literally changed my life. I don't want to go back to solo programming, and my projects aren't exactly swimming in a sea of active contributors.

CamperBob2 6 hours ago [ - ]

Not in the future, not if you want to get paid.

OtomotO 15 hours ago [ - ]

Because you will not be the only one struggling to get the hardware in the "unlikely" case the POTUS blurts out another fart.

alfiedotwtf 12 hours ago [ - ]

> At any point the USA could come up with some arbitrary export controls

lol his already happened with Fable!

jrm4 17 hours ago [ - ]

At current "proprietary inference company behavior," investing in local inference sounds like the exceedingly far more rational option.

Long term predictability ought to far outweigh a few more cycles of performance.

laserlight 7 hours ago [ - ]

Don't forget the fact that you'll be questioned to death when you criticize the current generation of models, but somehow, when the new models arrive you'll be questioned to death if you don't find them better than the old ones.

trollbridge 10 hours ago [ - ]

There are open models with groundbreaking innovations, like MiMo-2.5-Pro-UltraSpeed which you simply can't get anywhere else (there is no other model with those capabilities that I can get with 1000 token/second speed).

realusername 17 hours ago [ - ]

There's also a lot of benchmark trickery going on, it's becoming harder to see how the latest models really improved.

The top models also seem to have inconsistent performance depending on the time of day and how far we are from the next release.

bonesss 17 hours ago [ - ]

I’m an LLM fan, but from an engineering perspective the idea of building atop services that palpably fluctuate in capacity, performance, and capability is nutty.

Even with minor automation I feel like I can watch OpenAI and Anthropic engineers fiddling in real-time. Tuesdays behaviour changes by Thursday, 10AMs production isn’t possible at 11:30AM. Nutty.

targafarian 16 hours ago [ - ]

I chilled significantly on using Google for anything to do with business due to API (and offering) stability. (Still use Google for personal things.) But AI models seem orders of magnitude more fluid, so to my risk-averse eye, they're nothing I'd base my own business on.

senordevnyc 9 hours ago [ - ]

Imagine having a business where you're at the mercy of the fluctuations in capacity, performance, and capability that your human employees display!

intothemild 12 hours ago [ - ]

Since I started running my own inference server, I've had zero degradation that I didn't do myself. Basically the only time I see it get worse is if I drop one of the quants.

Which is what I suspect the providers are doing to fit more inference on the same amount of hardware over time.

Barbing 17 hours ago [ - ]

Interesting, Claude might be doing better since I last checked:

https://marginlab.ai/trackers/claude-code-historical-perform...

There were at least a couple of these degradation trackers.

fsuts 14 hours ago [ - ]

Agreed

4fffs 19 hours ago [ - ]

Correct. Anything else is pure marketing and you have fallen for it.