> If I am understanding this paper correctly, they are claiming that the model weights can be inverted in order to produce the original input text.
No, that is not the claim at all. They are instead claiming that given an LLM output that is a summary of chapter 18 of Mary Shelley's Frankenstein, you can tell that the input prompt that led to this output was "give me a summary of chapter 18 of Mary Shelley's Frankenstein". Of course, this relies on the exact wording: for this to be true, it means that if you had asked "give me a summary of chapter 18 of Frankenstein by Mary Shelley", you would necessarily receive a (slightly) different result.
Importantly, this needs to be understood as a claim about an LLM run with temperature = 0. Obviously, if the infra introduces randomness, this result no longer perfectly holds (but there may still be a way to recover it by running a more complex statistical analysis of the results, of course).
Edit: their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.
> their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.
It's the internal state; that's what they mean by "hidden activations".
If the claim were just about the output it'd be easy to falsify. For example, the prompts "What color is the sky? Answer in one word." and "What color is the "B" in "ROYGBIV"? Answer in one word." should both result in the same output ("Blue") from any reasonable LLM.
Even that is not necessarily true. The output of the LLM is not "Blue". It is something like "probability of 'Blue' is 0.98131". And it may well be 0.98132 for the other question. Certainly they only talk about the internal state in 1 layer of the LLM, they don't need the entire LLM values.
That's exactly what the quoted answer is saying though?
The point I'm trying to make is this: the LLM output is a set of activations. Those are not "hidden" in any way: that is the plain result of running the LLM. Displaying the word "Blue" based on the LLM output is a separate step, one that the inference server performs, completely outside the scope of the LLM.
However, what's unclear to me from the paper is if it's enough to get these activations from the final output layer; or if you actually need some internal activations from a hidden layer deeper in the LLM, one that does require analyzing the internal state of the LLM.
There are also billions of possible Yes/No questions you can ask that won't get unique answers.
The LLM proper will never answer "yes" or "no". It will answer something like "Yes - 99.75%; No - 0.0007%; Blue - 0.0000007%; This - 0.000031%" etc , for all possible tokens. It is this complete response that is apparently unique.
With regular LLM interactions, the inference server then takes this output and actually picks one of these responses using the probabilities. Obviously, that is a lossy and non-injective process.
If the authors are correct (I'm not equipped to judge) then there must be additional output which is thrown away before the user is presented with their yes/no, which can be used to recover the prompt.
It would be pretty cool if this were true. One could annotate results with this metadata as a way of citing sources.
Why do people not believe that LLMs are invertible when we had GPT-2 acting as a lossless text compressor for a demo? That's based on exploiting the invertibility of a model...
https://news.ycombinator.com/item?id=23618465 (The original website this links to is down but proof that GPT-2 worked as lossless text compressor)
I was under the impression that without also forcing the exact seed (which is randomly chosen and usually obfuscated), even providing the same exact prompt is unlikely to provide the same exact summary. In other words, under normal circumstances you can't even prove that a prompt and response are linked.
I'm under the impression that seed only effects anything if temperature > 0. Or more specifically that the LLM given a sequence of input tokens deterministically outputs the probability for each possible next token, and then the only source of randomness is in the procedure for selecting which of those next tokens to use. And that temperature = 0 means the procedure is "select the most likely one" with no randomness at all.
The seed and the actual randomness is a property of the inferencing infrastructure, not the LLM. The LLM outputs probabilities, essentially.
The paper is not claiming that you can take a dump of ChatGPT responses over the network and figure out what prompts were given. It's much more about a property of the LLM internally.