Hacker News

Check my understanding & follow-up Qs:

An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.

Architecture.:

    Model being analyzed (M): >|||||>  
    Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>  
    Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>

The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.

The AR is trained on a simple reconstruction loss.

The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).

- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?

- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?