Hacker News

My understanding is that it's neither impractical nor genuinely difficult, it's just that the "logging every step" approach provides explanations of their "reasoning" that are completely meaningless to us, as humans. It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain. Theoretically, yes, that should be sufficient to explain their color preferences, in that it fully models their brain. But practically, the explanation would be phrased in terms of atomic configurations in a way that makes much less sense to us than "oh, this person likes red because they like roses".

everdrive 2 days ago [ - ]

>It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain.

But this is an incredibly interesting problem!

qcnguy 2 days ago [ - ]

Anthropic have done some great work on neural interpretability that gets at the core of this problem.