Hacker News

Yep. It converges on truth unless there's a strong reward for lies because truth is easy. It's a neural network. It just reads off/probes the internal state because that's the cheapest way to model the unconscious. The justification won't necessarily be true, mind, in terms of the labels it puts, but it should mostly be true structurally- behaviorally predictive in the ordinary domain.

(Even if you are incentivized to lie and flatter yourself, it is still helpful to have access to the true signal internally, because that way you can know how to structure your lie to best avoid detection.)