Check my understanding & follow-up Qs:
An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.
Architecture.:
Model being analyzed (M): >|||||>
Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>
Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>
The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.The AR is trained on a simple reconstruction loss.
The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).
- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?
- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?