> "we have trained a model to reaffirm what we believe Claude is thinking" ?
It's more like "We have trained a model to produce a text that allows reconstruction of activations and the text happened to coincide with the results of other interpretability methods even after extensive training, while we expected it to devolve into unintelligible mess."
They found something unexpected and useful. They report it, while outlining limitations and ways to improve. It looks like a fine research to me.