> their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.

It's the internal state; that's what they mean by "hidden activations".

If the claim were just about the output it'd be easy to falsify. For example, the prompts "What color is the sky? Answer in one word." and "What color is the "B" in "ROYGBIV"? Answer in one word." should both result in the same output ("Blue") from any reasonable LLM.

Even that is not necessarily true. The output of the LLM is not "Blue". It is something like "probability of 'Blue' is 0.98131". And it may well be 0.98132 for the other question. Certainly they only talk about the internal state in 1 layer of the LLM, they don't need the entire LLM values.

That's exactly what the quoted answer is saying though?

The point I'm trying to make is this: the LLM output is a set of activations. Those are not "hidden" in any way: that is the plain result of running the LLM. Displaying the word "Blue" based on the LLM output is a separate step, one that the inference server performs, completely outside the scope of the LLM.

However, what's unclear to me from the paper is if it's enough to get these activations from the final output layer; or if you actually need some internal activations from a hidden layer deeper in the LLM, one that does require analyzing the internal state of the LLM.

There are also billions of possible Yes/No questions you can ask that won't get unique answers.

The LLM proper will never answer "yes" or "no". It will answer something like "Yes - 99.75%; No - 0.0007%; Blue - 0.0000007%; This - 0.000031%" etc , for all possible tokens. It is this complete response that is apparently unique.

With regular LLM interactions, the inference server then takes this output and actually picks one of these responses using the probabilities. Obviously, that is a lossy and non-injective process.

If the authors are correct (I'm not equipped to judge) then there must be additional output which is thrown away before the user is presented with their yes/no, which can be used to recover the prompt.

It would be pretty cool if this were true. One could annotate results with this metadata as a way of citing sources.