There is a clarification tweet from the authors:
- we cannot extract training data from the model using our method
- LLMs are not injective w.r.t. the output text, that function is definitely non-injective and collisions occur all the time
- for the same reasons, LLMs are not invertible from the output text
From the abstract:
> First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective
I think the "continuous representation" (perhaps the values of the weights during an inference pass through the network) is the part that implies they aren't talking about the output text, which by its nature is not a continuous representation.
They could have called out that they weren't referring to the output text in the abstract though.