Hacker News

I think I'm misunderstanding the abstract, but are they trying to say that given a LLM output, they can tell me what the input is? Or given an output AND the intermediate layer weights? If it is the first option, I could use as input 1 "Only respond with 'OK'" and "Please only respond with 'OK'" which leads to 2 inputs producing the same output.

ndr 3 days ago [ - ]

That's not what you get out of LLMs.

LLMs produce a distribution from which to sample the next token. Then there's a loop that samples the next token and feeds it back to to the model until it samples a EndOfSequence token.

In your example the two distributions might be {"OK": 0.997, EOS: 0.003} vs {"OK": 0.998, EOS: 0.002} and what I think the authors claim is that they can invert that distribution to find which input caused it.

I don't know how they go beyond one iteration, as they surely can't deterministically invert the sampling.

simiones 3 days ago [ - ]

Edit: reading the paper, I'm no longer sure about my statement below. The algorithm they introduce claims to do this: "We now show how this property can be used in practice to reconstruct the exact input prompt given hidden states at some layer [emp. mine]". It's not clear to me from the paper if this layer can also be the final output layer, or if it must be a hidden layer.

They claim that they can reverse the LLM (get prompt from LLM response) by only knowing the output layer values, the intermediate layers remain hidden. So, Their claim is that indeed you shouldn't be able to do that (note that this claim applies to the numerical model outputs, not necessarily to the output a chat interface would show you, which goes through some randomization).

3 days ago [ - ]

[deleted]