The force equation example is disturbing, but it's easy to prevent by disallowing the inclusion of random decimal numbers in the formula, with the latter also suggesting over-fitting to the data. It is immediately obvious that such numbers make the equation inelegant and therefore likely to be wrong. If you're going to use symbolic construction, be careful in what formulations you allow, also having an appropriate penalty for complexity.
As for chess, although an LLM knows the rules of chess, it is not expected to have been trained on many optimal chess games. As such, is it fair to gauge its skill in chess, especially without showing it generated images of its candidate moves? Even if representational and training limitations were addressed, we know that LLMs are architecturally crippled in that they have no neural memory beyond their context. Imagine a next-gen LLM that if presented with a chess puzzle would first update its internal weights for playing optimal chess via a simulation of a billion games, and then return to address the puzzle you gave it. Even with the current arch, it could equivalently create a fork of itself for the same purpose, a new trained model in effect, but the rushing human's desire for wanting the answer immediately comes in the way.
>As for chess, although an LLM knows the rules of chess, it is not expected to have been trained on many optimal chess games
Well, it's read every book ever written on chess so you would expect it to be at least half-way decent.
GothamChess has a very popular YouTube channel about chess and he made a chatbot championship in
* 2025 https://www.youtube.com/playlist?list=PLBRObSmbZluRddpWxbM_r...
* 2026 https://www.youtube.com/playlist?list=PLBRObSmbZluQwBIvxyiWf...
I recommend to watch at least the last game of each list, that has the final game with the bots that play the best.
My takeaway:
Most chatbots know openings very well, the problem start when one of them makes an unexpected (legal or ilegal) move. Some models just copy moves from old games that make no sense in this game, and other models continue playing (almost) correctly. In particular ChatCPT was very bad in 2025 but very good in 2026.
(When a chatbot makes an ilegal move, most of the times he just follow the bot instructions. I think it's bad because it confuses the other chatbot that may interpret the incorrect move in a different way. Let's say if white moves the rook form a1 to a8 jumping over a pawn in a4, he may left the pawn in position but black may interpret that magically there is no pawn in a4. Anyway, he is in the show business, not in the let-s-get-a-nobel business, and weird games are more fun to cast.)