Summary from the authors:
-Different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space
- Injectivity is not accidental, but a structural property of language models
- Across billions of prompt pairs and several model sizes, we find no collisions: no two prompts are mapped to the same hidden states
- We introduce SipIt, an algorithm that exactly reconstructs the input from hidden states in guaranteed linear time.
- This impacts privacy, deletion, and compliance: once data enters a Transformer, it remains recoverable.
> - This impacts privacy, deletion, and compliance
Surely that's a stretch... Typically, the only thing that leaves a transformer is its output text, which cannot be used to recover the input.
If you claim, for example, that an input is not stored, but examples of internal steps of an inference run _is_ retained, then this paper may suggest a means for recovering the input prompt.
remains recoverable... for less than a training run of compute .It's a lot, but it is doable
Here's an output text: "Yes." Recover the exact input that led to it. (you can't, because the hidden state is already irreversibly collapsed during the sampling of each token)
The paper doesn't claim this to be possible either, they prove the reversibility of the mapping between the input and the hidden state, not the output text. Or rather "near-reversibility", i.e. collisions are technically possible but they have to be very precisely engineered during the model training and don't normally happen.
if you generate a lot of output text you can approximate the hidden state.