This is specifically a consumer model (or specifically ChatGPT) issue. e.g. IME codex does not do this, and will just tell you when you're missing something or somehow wrong, and Gemini does this weird thing where it tells you you're a genius and then immediately starts correcting everything you said.

Sycophancy is just one aspect of the problems I mentioned, though. Another huge one is hallucination, and one that is actually far worse than I thought:

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.

https://arrowtsx.dev/bigger-models/

I think even a 5% hallucination rate would be terrible for a teacher, who should generally be comfortable with saying "I don't know off the top of my head but here is how to find resources to answer your question".

---

So, just to drive the point home, Codex has an 86.9% hallucination rate on the AA-omniscience score in this index https://benchlm.ai/models/gpt-5-3-codex - if you ask it something that wasn't sufficiently covered in its training data, it will confidently make up an answer nearly 87% of the time.

While you might think it is happy to correct you when you are wrong, you don't know that for sure since you don't know when you're wrong. Codex may have been happily agreeing with you about things you had completely backwards.

Except I generally do know when I'm wrong because I'm working in a domain I am familiar with, and it will often create experiments on the fly unprompted (well, prompted, but generically in AGENTS.MD) to check itself. My experience actually using it for software is that it almost never makes up answers. The answer for hallucinations is fairly simple: give it facts and tools to ground itself.

> Except I generally do know when I'm wrong because I'm working in a domain I am familiar with, and … My experience actually using it for software is that it almost never makes up answers.

Yes I am certain that it feels that way. However empirical testing holds a lot more weight than anecdotes.

> The answer for hallucinations is fairly simple: give it facts and tools to ground itself.

The entire danger here is that it hallucinates when you don’t know the ground facts. After all, you don’t know what you don’t know.

> Gemini does this weird thing where it tells you you're a genius and then immediately starts correcting everything you said.

That's a great way to get you to listen because your guard is down. Imagine if it told you you were an idiot and then corrected you.