The parent is off, you’re right. They may reason in any language, typically whatever the user’s language is, and you’ll see the reasoning directly with an open model like Deepseek.

Research only showed that thinking might be disconnected from the final output but in my experience they are very strongly correlated in recent models

> Research only showed that thinking might be disconnected from the final output

It is trivial to regularly spot obvious contradictions and inconsistencies if you read carefully. For example I've encountered traces that amounted to "I can deduce X, therefore Y, so that means Z" but then the model turns around and outputs "the answer is W because X". It's even been demonstrated that having the model output placeholder tokens or other gibberish instead of "thoughts" still improves performance. However the thinking traces can still be useful to the end user regardless.