Hacker News

The model learns to reason on its own. If you only reward correct results but not readable reasoning, it will find its own way to reason that is not necessarily readable by a human. The chain may look like English, but the meaning of those words might be completely different (or even the opposite) for the model. Or it might look like a mix of languages, or just some gibberish - for you, but not for the model. Many models write one thing in the reasoning chain and a completely different in the reply.

That's the nature of reinforcement learning and any evolutionary processes. That's why the chain of thought in reasoning models is much less useful for debugging than it seems, even if the chain was guided by the reward model or finetuning.