Hacker News

That's not what you get out of LLMs.

LLMs produce a distribution from which to sample the next token. Then there's a loop that samples the next token and feeds it back to to the model until it samples a EndOfSequence token.

In your example the two distributions might be {"OK": 0.997, EOS: 0.003} vs {"OK": 0.998, EOS: 0.002} and what I think the authors claim is that they can invert that distribution to find which input caused it.

I don't know how they go beyond one iteration, as they surely can't deterministically invert the sampling.