There are some things that you still can't do with LLMs. For example, if you tried to learn chess by having the LLM play against you, you'd quickly find that it isn't able to track a series of moves for very long (usually 5-10 turns; the longest I've seen it last was 18) before it starts making illegal choices. It also generally accepts invalid moves from your side, so you'll never be corrected if you're wrong about how to use a certain piece.

Because it can't actually model these complex problems, it really requires awareness from the user regarding what questions should and shouldn't be asked. An LLM can probably tell you how a knight moves, or how to respond to the London System. It probably can't play a full game of chess with you, and will virtually never be able to advise you on the best move given the state of the board. It probably can give you information about big companies that are well-covered in its training data. It probably can't give you good information about most sub-$1b public companies. But, if you ask, it will give a confident answer.

They're a minefield for most people and use cases, because people aren't aware of how wrong they can be, and the errors take effort and knowledge to notice. It's like walking on a glacier and hoping your next step doesn't plunge through the snow and into a deep, hidden crevasse.

LLMs playing chess isn't a big deal. You can train a model on chess games and it will play at a decent ELO and very rarely make illegal moves(i.e 99.8% legal move rate). There are a few such models around. I think post training messes with chess ability and Open ai et al just don't really care about that. But LLMs can play chess just fine.

[0] https://arxiv.org/pdf/2403.15498v2

[1] https://github.com/adamkarvonen/chess_gpt_eval

Jeez, that arxiv paper invalidates my assumption that it can't model the game. Great read. Thank you for sharing.

Insane that the model actually does seem to internalize a representation of the state of the board -- rather than just hitting training data with similar move sequences.

...Makes me wish I could get back into a research lab. Been a while since I've stuck to reading a whole paper out of legitimate interest.

(Edit) At the same time, it's still worth noting the accuracy errors and the potential for illegal moves. That's still enough to prevent LLMs from being applied to problem domains with severe consequences, like banking, security, medicine, law, etc.

> people aren't aware of how wrong they can be, and the errors take effort and knowledge to notice.

I have friends who are highly educated professionals (PhDs, MDs) who just assume that AI\LLMs make no mistakes.

They were shocked that it's possible for hallucinations to occur. I wonder if there's a halo effect where the perfect grammar, structure, and confidence of LLM output causes some users to assume expertise?

Computers are always touted as deterministic machines. You can't argue with a compiler, or Excel's formula editor.

AI, in all its glory, is seen as an extension of that. A deterministic thing which is meticulously crafted to provide an undisputed truth, and it can't make mistakes because computers are deterministic machines.

The idea of LLMs being networks with weights plus some randomness is both a vague and too complicated abstraction for most people. Also, companies tend to say this part very quietly, so when people read the fine print, they get shocked.

> I wonder if there's a halo effect where the perfect grammar, structure, and confidence of LLM output causes some users to assume expertise?

I think it's just that LLMs are modeling generative probability distributions of sequences of tokens so well that what they actually are nearly infallible at is producing convincing results. Often times the correct result is the most convincing, but other times what seems most convincing to an LLM just happens to also be most convincing to a human regardless of correctness.

https://en.wikipedia.org/wiki/ELIZA_effect

> In computer science, the ELIZA effect is a tendency to project human traits — such as experience, semantic comprehension or empathy — onto rudimentary computer programs having a textual interface. ELIZA was a symbolic AI chatbot developed in 1966 by Joseph Weizenbaum and imitating a psychotherapist. Many early users were convinced of ELIZA's intelligence and understanding, despite its basic text-processing approach and the explanations of its limitations.

Its complete bullshit. There is no way anyone ever thought anything was going on in ELIZA. There were people amazed that "someone could program that" but they had no illusions about what it was, it was obvious after 3 responses.

Don't be so sure. It was 1966, and even at a university, few people had any idea what a computer was capable of. Fast forward to 2025...and actually, few people have any idea what a computer is capable of.

[deleted]

If I wasn't familiar with the latest in computer tech, I would also assume LLMs never make mistakes, after hearing such excited praise for them over the last 3 years.

It is only in the last century or so, that statistical methods were invented and applied. It is possible for many people to be very competent at what they are doing and at the same time be totally ignorant of statistics.

There are lies, statistics and goddamn hallucinations.

My experience, speaking over a scale of decades, is that most people, even very smart and well-educated ones, don't know a damn thing about how computers work and aren't interested in learning. What we're seeing now is just one unfortunate consequence of that.

(To be fair, in many cases, I'm not terribly interested in learning the details of their field.)

Have they never used it? Majority of the responses that I can verify are wrong. Sometimes outright nonse, sometimes believable. Be it general knowledge or something where deeper expertise is required.

I worry that the way the models "Speak" to users, will cause users to drop their 'filters' about what to trust and not trust.

We are barely talking modern media literacy, and now we have machines that talk like 'trusted' face to face humans, and can be "tuned" to suggest specific products or use any specific tone the owner/operator of the system wants.

> I have friends who are highly educated professionals (PhDs, MDs) who just assume that AI\LLMs make no mistakes.

Highly educated professionals in my experience are often very bad at applied epistemology -- they have no idea what they do and don't know.

It's super obvious even if you try and use something like agent mode for coding, it starts off well but drifts off more and more. I've even had it try and do totally irrelevant things like indent some code using various Claude models.

My favourite example is something that happens quite often even with Opus, where I ask it to change a piece of code, and it does. Then I ask it to write a test for that code, it dutifully writes one. Next, I tell it to run the test, and of course, the test fails. I ask it to fix the test, it tries, but the test fails again. We repeat this dance a couple of times, and then it seemingly forgets the original request entirely. It decides, "Oh, this test is failing because of that new code you added earlier. Let me fix that by removing the new code." Naturally, now the functionality is gone, so it confidently concludes, "Hey, since that feature isn't there anymore, let me remove the test too!"

"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." - Claude, probably

Yeah, the chess example is interesting. The best specialised AIs for chess are all clearly better than humans, but our best general AIs are barely able to play legal moves. The ceiling for AI is clearly much higher than current LLMs.

Large Language Models aren't general AIs. Its in the name.

They are being marketed as such…

> you'd quickly find that it isn't able to track a series of moves for very long (usually 5-10 turns; the longest I've seen it last was 18)

In chess, previous moves are irrelevant, and LLM aren't good with filtering out irrelevant data [1]. For better performance, you should include only the relevant data in the context window: the current state of then board.

[1] https://news.ycombinator.com/item?id=44724238