> it is clear that actual intelligence has plateaued significantly.

> Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse

These are wild claims - why are we concluding that bigger models and more data = more hallucination? That’s actually the opposite of what’s been happening over the last couple years. Some models may still hallucinate more but they all hallucinate much less than the original 175B ChatGPT which was smaller and trained on (much) less data than anything current.

Edit: My mention of data comes from this quote:

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling

My take on the current situation: it seems clear that the industry has seen that there is still a lot left to squeeze out of sub-1T models. But for that you do need more, high-quality data in the distribution which you want to unlock capabilities for.

> why are we concluding that bigger models and more data = more hallucination?

That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations

The relevant quote for what you’re talking about would be:

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer.

So there’s two separate claims: 1) bigger models have plateauing results 2) models trained on larger amounts of factual data have a higher hallucination rate

I’m pretty sure #1 is well known, I think OpenAI’s own research on scaling laws showed diminishing returns on parameter count and training data volume years ago. I don’t know what the support for #2 is besides for the actual post contents.

I find these internet arguments talking about LLMs as if they are trained by reading the internet to be wild.

Yes, pretraining still exists. But for the past few years, pretraining by reading the internet is just the initial bootstrapping of LLM training. The RL training they get from bespoke training data, with very very different characteristics than what these armchair analyses claim, dominates these days.

I'd have to imagine there are wildly diminishing marginal returns to additional SFT/post-training passes.

There are a bounded number of (useful) derivations/combinations of Duff's device.

If Frontier Labs wish to reduce hallucinations on factual things, they will have to hire people (or the data providers will need to) to do fundamental research above and beyond what is available in extant literature and the web. IE if the LLMs want to lower precision error, they need to go out and actually find more expertise. If the wikipedia page for Pompey lacks data, where are they going to get it from? How would they even _identify_ that the page has holes?

Yes, they can digitize more books but that is untrustworthy data - if there were enough eyeballs on a particular work, it would be in the internet. If it's not, they'd need to hire the experts themselves. They need expert reviewers in virtually every interesting topic, which fundamentally is an intractable problem, especially since things change all the time. Maybe even uninteresting topics, too?

I dunno, it doesn't seem to me "more data" is the magic bullet here. Yeah, it will "help" but we're already on the flat part of the S shaped curve.

My take from trying to understand this stuff is some sort of algorithmic improvement is necessary to get another step change in how well LLMs perform in this area. I could be wrong!

As a side gig, I write novel software that solves problems no existing software does, that existing LLMs have difficulty reproducing, purely for the purpose of existing as LLM training data.

There are journalists being hired to write Atlantic-worthy articles that exist only as LLM training data, because they're getting paid more than the Atlantic would pay them for it.

It's insane.

Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data.

The largest characteristic of all of this new data is it is targeted at LLM's weak points.

It's not just more data, it's custom tutorials built for what LLMs struggle at.

I'm not saying they are not trying - I'm saying we're inventing new problems faster than any Lab can:

1) Identify the gaps

2) Determine how to fix them

3) Implement a fix (especially if that fix is: identify and find experts)

4) And judge the result

How do they know [person] is an expert in [some field]? How do they find that person? How many experts are necessary to give the right information? How do we evaluate the results, especially if it's novel?

You can find a lot of people who disagree on many topics, and those turtles go all the way down.

I'm not in disagreement that your work will help reduce hallucinations and improve model performance! It is.

I predict (I hope I'm wrong!) that we're going to hit some asymptote that is not at 0% hallucinations (and I would even put a substantial nonzero probability that "overall" hallucination rate bottoms out at some minimum and then slowly grows because we just can't keep up with the new garbage we throw at it).

> How do they know [person] is an expert in [some field]? How do they find that person?

You just stumbled upon billion dollar businesses: Mercor, micro1, Scale AI, Surge AI, etc

> How do they know [person] is an expert in [some field]? How do they find that person?

They have a PhD from a top school, they are a licensed attorney, they are a licensed physician, a board certified cardiologist, etc.

They are constantly recruiting from these populations with well-paying side gigs.

> 4) And judge the result

That's what they pay the experts for. And to have experts review the other experts with peer review.

> You can find a lot of people who disagree on many topics, and those turtles go all the way down.

Which is why everything has to be well-calibrated and not just a hot take - a well reasoned opinion any expert would find fair.

Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks. Can they move the bar on the complexity of software LLMs do on their own? Can they get to a point where LLMs can begin to replace physicians? Financial advisors? Actuaries? etc.

> Noone is really caring about hallucinations on point facts these days though, it is much more about complex reasoning tasks.

The boundary is pretty thin there though. E.g., Gemini recently told me that a certain papers claims that two frameworks are mathematically equivalent, while the paper shows the opposite, and yesterday Google's AI overview told me that no World Cup matches were scheduled for that day despite their being several of them. The model probably used complex reasoning to arrive at both (incorrect) answers, but superficially they look like basic errors of fact.

That is a great example of the kind of thing they're paying people to create as training data.

You write the prompt, and then write rubrics to judge the responses, and you found something the model failed at. Congratulations, you just earned $500, now do it again.

Ahhhh! the ever-present omniscient "they" of paranoia!

But be careful: they are watching you and they don't want you giving away their secrets!

1. How did you land the side gig? Mercor or a lessor known brand?

2. What criteria do such vendors typically require?

I've done Mercor and other brands - the contracts move around, since the labs want the vendors to know they're just vendors and have to compete with each other. It seemed to be roughly resume and interview similar to getting hired at a senior role at FAANG or adjacent.

jmalicki says many things, among them being

"As a side gig, I write novel software that solves problems no existing software does,"

and

"Yes, they are hiring the experts themselves. To create new knowledge above and beyond what's on the internet. To be locked away as LLM training data."

More likely you're joking and/or paranoid!8-))

> I write novel software that solves problems no existing software does

This is actually really easy to do if you step out of web/gui/crud and into something where you won't find public code, most ever, because it's trade secret. For example, manufacturing.

There is also an endless fountain of things you come across every day and think "oh, wouldn't this complex solution to this low priority problem be cool", but noone ever implements it because it's too complex and the problem is low priority.

Anyone writing software for long enough has a long list of these things in the back of their head that are great fodder for LLM training data.

I wish our actual world wasn't an implausible scifi novel!

What kind of programs? Can you give an example of the tasks?

Outside of games and coding generating enough valid examples and counter-examples to harness the power of RL is cost prohibitive.

Which is why rubrics as rewards are used.

Where do they get the bespoke training data from? And how much? I don’t really know anything about this.

> And how much?

Mercor, one of the larger vendors for contracting with experts to create bespoke data, says on their webpage they're paying $3M/day to their contractors for data.

So well into the billions of dollars a year for bespoke training data.

That's also ignoring the RLVR data labs can get from software - they can use the vibe coding sessions as training data as well without paying more.

They are just one of many.

Companies like Mercor sell data from human experts

Offhand, do you know what format that data is in? Is it a question and then a human answering that question? Mostly just curious at to what the training data consists of.

The most advanced training data is in the form of rubrics as rewards.

A human asks a question, then writes rubrics to judge the LLMs response, so rather than evaluating a specific response, those rubrics can live on as the LLM evolves and gives different answers. There are more complex variants as well, but that's the basic principle.

https://arxiv.org/abs/2507.17746

meta has reallocated a significant protion of their staff to genrating this

Meta also reportedly took a 49% nonvoting stake in Scale AI in June 2025 for about $14.3–$14.8 billion.

let me take down armchair analysis with my armchair analysis

That’s not what your quotes said. They said bigger models = plateau in intelligence, nothing about more data or increased hallucinations ... I’m pretty sure #1 is well known

Well known in a multiverse branch where Fable was a dud?

No, well known in the current multiverse branch where we still occasionally use things like math and scientific analysis instead of people’s vibe checks and pelican SVGs.

Here’s the paper from OpenAI where Dario himself was a co-author: https://arxiv.org/pdf/2001.08361

> We have observed consistent scalings of language model log-likelihood loss with non-embedding parameter count N, dataset size D, and optimized training computation Cmin, as encapsulated in Equations (1.5) and (1.6). Conversely, we find very weak dependence on many architectural and optimization hyperparameters. Since scalings with N,D,Cmin are power-laws, there are diminishing returns with increasing scale.

instead of people’s vibe checks and pelican SVGs.

Right, what happened is everyone went to Fable and asked it to make the very best bicycle pelican SVG, no mistakes. And Fable's bicycle pelican SVGs were such timeless masterpieces, we all instantly got AI psychosis. Happily, you were immune to this.

[deleted]

Yeah #2 may be incidental. Suppose one lab focused on bigger, and another on reinforcement training geared towards factual accuracy over sycophancy. You could easily wind up with a model from the second lab that is less powerful but more accurate.

I can’t prove it but I suspect there’s a bit of that going on.

I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.

Yeah not only is it totally unsubstantiated, the benchmarks are getting less useful to really show the difference between these models. Big model smell is still a thing and GLM 5.2 while impressive is not Fable class.

Here is something I would like people to chew on. Perhaps the smartest researchers in the world across multiple labs know more about this than we do? Perhaps they are aware of issues like the data wall and diminishing marginal returns. And perhaps they are being honest when they tell you there is no wall?

Are the smartest researchers in the world out there saying there isn't a wall? I don't know of any people doing the actual R&D who frequently make outrageous claims.

https://x.com/polynoamial/status/2064210146558136827

I'd say that as OpenAI employee he's kinda biased on the topic

> A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling

I'm pretty sure it's mostly due to the training data quality. No idea, why this never gets mentioned in those discussions.

It was obvious right from the get go, that the scaling law just enabled some abilities, that were described by the underlying data and allowing the ANN to abstract it in the latent space.

Aren't hallucinations also heavily influenced by compute and memory capacity? IE. Companies can spend more time to verify results in an agentic format, spend more thinking tokens, and less quantization. All of these heavily depend on compute and memory but are proven to decrease hallucinations.

Maybe GPT 5.5 is heavily nerfed due to lack of compute, memory, and energy?

I agree that it's farfetched to conclude that bigger models have pleateued.

article specifically talks about this. deepseek spending significant test time with worse results than klm

Isn't that the case of over fitting? You have more data, but when you ask something that's not in that data, hallucinations happen

>> it is clear that actual intelligence has plateaued significantly.

> These are wild claims -

Indeed, it is not clear there was any actual intelligence at any point.

A lot of generated content sure, sometimes even useful, but not necessarily anything more.

What is the definition of "actual intelligence"? How does it differ from regular intelligence and non-intelligence?

If someone can "design a custom asyncio event loop policy in that overrides get_child_watcher()", I would call that person intelligent. Does that mean that person is not actually intelligent but a mere content creation machine?

Traditionally if you can create content, this shows you're intelligent. Created content is often called "intellectual" property. If a person can understand complex ideas and make connection between them, that is considered intellectual work. You have to be intelligent to do intellectual work. If a person can solve problems, this is also called intelligence. If the person can solve more complex problems, that person is said to have higher intelligence. This is often measured with a scale called IQ (Intelligence Quotient). There are other types of intelligence but they are basically the variations of the same ability. Most definitions of intelligence also involve an ability to adapt into the environment.

Since intelligence is such a broad concept what exactly is the difference between the actual intelligence and AI, other than one is natural and the other one is artificial?

I understand being anti-AI because of the very real societal concerns. But ignoring what is in front of you is not a solution.

>These are wild claims - why are we concluding that bigger models and more data = more hallucination?

Because that's what they measured in this case.

How do we know gpt 5.5 is a bigger model

Since it was created by _Open_AI surely it's really open and we can check, right? SCNR

My impression is that the fundamental issue is that LLMs attempt to extract reasoning (executive execution) from data (relationship between tokens).

There's an open question about whether this is theoretically possible, but it doesn't seem like it to me.

Human generated data is an effect of reasoning. Attempting to extract executive function from it is kind of like taking an anti-derivative of a function.

This has always seemed like the root of hallucinations to me. It sort of follows the parallels to lossy compression that a lot of people draw. You're extracting some characteristics by observing the relationship between tokens, and then trying to argue that those characteristics are equivalent to the thing that generated the original tokens.

Surely there's some sort of overlap there, but viewed that way, it seems obvious that more and more parameters and scaling won't solve the fundamental problem. There's only so much meaning you can extract from token relationships.

It's like trying to derive the shape of a flame from the smoke it produces.

The original intelligence that created those tokens was driven by a whole universe of inputs, from hormones to starlight to gravity, not to mention all of the strange things about consciousness and parapsychology that is so poorly understood.

The machines are definitely useful for a certain class of tasks - those that don't require much executive function, and the useful work mostly involves pattern matching.

The problem is, we seem to be mistaking effect for cause and imagining that these things have greater capabilities than they'll ever posess.

The investors that don't understand this are indeed going to learn a bitter lesson.

to train models to be smarter than they are, one needs examples and cases to train on, and once you get close to the top percentiles of human reasoning there is extremely little such material available.

You can create contrived logic problems, but they often turn into language games because English is not formal logic.

And you can train on "monty hall" style problems, but those too are language games that are intriguing to humans but obvious when framed slightly differently.

In other words, model trainers are fighting against the overwhelming mediocrity of the training corpus (all of the recorded human output from history).

As models improve, the next phase will be models co-designed with humans to overcome these limits. The way we use language and the process we use to problem solve (we currently call this "orchestration") will evolve as part of this. Meatspace metaphors map badly when we have massive context and don't need the same limits. How different is hallucination from extrapolation, etc.

Much of the skepticism and confusion about LLMs is no different than a person of average intelligence hearing a highly intelligent person explain something and considering the explanation gibberish, then arrogantly accusing the intelligent person of being unhelpful.

Much like dogs were domesticated from wolves to have traits that make them good around humans, LLMs will evolve around our limits, around our arrogance, around our aesthetic biases and prejudices. Intelligence and rationality is fundamentally not what most humans want from an LLM.

you mixed two random quotes from the article to create a strawman.

ofcourse you knew what you were doing but disappointing that this was top comment.

In cognitive science, it appears your brain has two modes of thinking:

- A very parallel type of computation that is fast and generally accurate and integrates hundreds of variables. It’s sometimes labeled as intuition or system 1 thinking.

- A much slower, step by step, analytical type, commonly linked with your pre-frontal cortex (one of the newest parts of the brain). Sometimes called system 2 thinking.

Maybe the way the universe works is that all computation more or less is one of those two types. In which case, an LLM alone is only the first part, which is often right but its results also cannot ever be proven.

An LLM is not thinking, assuming and relating it to thought and universal truths is nonsense.

We inflicted that to ourselves by picking the most confusing terminology ever. "No, reasoning isn't thinking. No when the model says it thinks it's not actually thinking... No an agent isn't actually a creature with agency... No, when we say it hallucinates it doesn't, like, actually hallucinate"

What were the alternatives?

Did you mean sentient?