I was worried about that a couple of years ago, when there was a lot of hope that deeper reasoning skills and hallucination avoidance would simply arrive as emergent properties of a large enough model.

More recently, it seems like that's not the case. Larger models sometimes even hallucinate more [0]. I think the entire sector is suffering from a Dunning Kruger effect -- making an LLM is difficult, and they managed to get something incredible working in a much shorter timeframe than anyone really expected back in the early 2010s. But that led to overconfidence and hype, and I think there will be a much longer tail in terms of future improvements than the industry would like to admit.

Even the more advanced reasoning models will struggle to play a valid game of chess, much less win one, despite having plenty of chess games in their training data [1]. I think that, combined with the trouble of hallucinations, hints at where the limitations of the technology really are.

Hopefully LLMs will scare society into planning how to handle mass automation of thinking and logic, before a more powerful technology that can really do it arrives.

[0]: https://techcrunch.com/2025/04/18/openais-new-reasoning-ai-m...

[1]: https://dev.to/maximsaplin/can-llms-play-chess-ive-tested-13...

really? I find newer models hallucinate less, and I think they have room for improvement, with better training.

I believe hallucinations are partly an artifact of imperfect model training, and thus can be ameliorated with better technique.

Yes, really!

Smaller models may hallucinate less: https://www.intel.com/content/www/us/en/developer/articles/t...

The RAG technique uses a smaller model and an external knowledge base that's queried based on the prompt. The technique allows small models to outperform far larger ones in terms of hallucinations, at the cost of performance. That is, to eliminate hallucinations, we should alter how the model works, not increase its scale: https://highlearningrate.substack.com/p/solving-hallucinatio....

Pruned models, with fewer parameters, generally have a lower hallucination risk: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00695.... "Our analysis suggests that pruned models tend to generate summaries that have a greater lexical overlap with the source document, offering a possible explanation for the lower hallucination risk."

At the same time, all of this should be contrasted with the "Bitter Lesson" (https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...). IMO, making a larger LLMs does indeed produce a generally superior LLM. It produces more trained responses to a wider set of inputs. However, it does not change that it's an LLM, so fundamental traits of LLMs - like hallucinations - remain.