Hacker News

Edit: reading the paper, I'm no longer sure about my statement below. The algorithm they introduce claims to do this: "We now show how this property can be used in practice to reconstruct the exact input prompt given hidden states at some layer [emp. mine]". It's not clear to me from the paper if this layer can also be the final output layer, or if it must be a hidden layer.

They claim that they can reverse the LLM (get prompt from LLM response) by only knowing the output layer values, the intermediate layers remain hidden. So, Their claim is that indeed you shouldn't be able to do that (note that this claim applies to the numerical model outputs, not necessarily to the output a chat interface would show you, which goes through some randomization).