It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark
It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.
That one is a bit sus to me, because the models that do the worst on Omniscience Accuracy do the best on non-hallucination. The top model for this benchmark is "MiniCPM5-1B (Non-reasoning)" which gets a whopping 99% vs 45% for Fable 5.
I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.
This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?
The issue with having a "no answer" option is that you implicitly add a decision problem into your test that depends on the "cost" of answering wrong.
Specifically, your model now has two "correct" classes p(class=y|x) and p(class=⊥|x). This makes the results ambiguous. The way you resolve this is by adding in a cost of missclassification and a cost of answering wrong.
L(y, y') =
0 if y=y' l_err if y≠y' and y'≠⊥ l_⊥ if y' = ⊥
You can then estimate the expected error over your dataset. Notice that this now gives you additional degrees of freedom: Depending on how expensive answering wrong is compared to not answering at all, your predictor might be really bad or really good.
This means when benchmarking with a "no answer" action, you are often not actually benchmarking whether the model works well or not, but rather are benchmarking how well the model _happens_ to agree with the class-error weight you (implicitly) chose in your model.
There is a tradeoff where as factual accuracy increases, creativity decreases, and the model becomes more "rigid" and less general. Unfortunately it seems that creativity is a good quality for reasoning and ultimately problem solving.
So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.
Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something
if this is the case, then GLM 5.2 model seems better than gpt 5.5 or maybe even "Fable" depending upon what you are trying to achieve.
Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)
Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.
It's simpler than that.
An LLM outputs tokens, one-by-one. It stops the loop if it outputs the end-of-text token. Which is, of course, statistically much rarer than any other kind of token.
(This is why you cannot, in general, prompt an LLM with something like "don't answer if the result is correct". It has to output something, by design.)
A lot of benchmarks are setup to not punish false positives (irrelevant answers or extra text) and punish false negatives (missing the snippet being looked for).
This leads to answer bloat and/or hallucination if you benchmaxx on those
They are, especially multiple choice questions. The same happens with humans exams:
Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.
If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.