Hacker News

It's worth mentioning that this is a different scenario to the reasoning models though. Reasoning models use the generated text to arrive at an answer, in a sense, it cannot lie until it gives the answer. That answer may express a reasoning that was not the reasoning used. That bit is the lie.

You can actually take this further when you consider deepseek style reinforcement. While the reasoning text may appear to show the thought process used in readable language, the model is trained to say whatever it needs to generate the right answer, that may or may not be what that text means to an outside observer. In theory it could encode extra information in word lengths or even evolve it's own Turing complete gobbledegook. There are many degrees of likelihood in the options available. Perhaps one more likely is some rarely used word has some poorly trained side-effect that gives the context a kick in the right direction right before it was going to take a fork going the wrong way. Kind of a SolidGoldMagikarp spanking.