> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.

Wow! I already knew from previous research shared here that hallucinations are a fundamental problem for LLMs and likely to be unfixable, just like prompt injection, but I didn't realize the hallucination rates were so bad!

Everyone has been acting like the best models only hallucinate in edge cases, but even the best performing one mentioned here - GLM-5.2 - has a hallucination rate of 28% when it doesn't "know" the answer to something.

That said, I think the title on the blog - "Bigger models are not the way" is probably more fitting and touches on what should be even bigger news. If bigger models and bigger training sets have already stopped producing proportional returns, then it seems likely we are already near the top of the S-curve. That's huge news, considering the valuation of companies like OpenAI and xAI is largely based around the (absurd) idea of ever increasing scaling from these models.

There is no concept of "knowledge" in LLM as it is on Wikipedia.

The question-tokens define the answer-tokens. That's it. The art relies in clustering the relevant weights together.

If it were that simple we’d all be talking with sql and yet this isn’t happening.

Circuits which emerge in the layers during training are much more complicated than a simple Bayesian relation.

Correct, LLMs are not ontologically capable of “knowing”. That is why I put “know” in quotes.

> There is no concept of "knowledge" in LLM as it is on Wikipedia.

There can be, you don't know if the closed source models aren't using something like DeepSeek's Engram.

The name "Engram" (n-gram) says it all - this is just another type of statistical word association, not a factual knowledge store.

While DeepSeek describe this as "knowledge lookup", what Engram is really trying to do is separate dynamic reasoning from static pattern recall, with the static patterns just being word-level n-gram statistics, not declarative facts/knowledge.

Just because 2-3 words often appear together in a sequence doesn't mean they represent a fact or truth (or falsehood) - it is just an n-gram statistical regularity.

If Engram helps reduce LLM GPU memory and FLOP requirements then that is great, but it's not a solution for Hallucination.

Agreed on the title, my bad! But yeah, I've had some truly terrible experiences using these "frontier" models in coding agents especially, where they just fabricate facts about codebases.