I remember hearing an argument once that said LLMs must be capable of learning abstract ideas because the size of their weight model (typically GBs) is so much smaller than the size of their training data (typically TBs or PBs). So either the models are throwing away most of the training data, they are compressing the data beyond the known limits, or they are abstracting the data into more efficient forms. That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.

I am sure I am not understanding this paper correctly because it sounds like they are claiming that model weights can be used to produce the original input text representing an extraordinary level of text compression.

> If I am understanding this paper correctly, they are claiming that the model weights can be inverted in order to produce the original input text.

No, that is not the claim at all. They are instead claiming that given an LLM output that is a summary of chapter 18 of Mary Shelley's Frankenstein, you can tell that the input prompt that led to this output was "give me a summary of chapter 18 of Mary Shelley's Frankenstein". Of course, this relies on the exact wording: for this to be true, it means that if you had asked "give me a summary of chapter 18 of Frankenstein by Mary Shelley", you would necessarily receive a (slightly) different result.

Importantly, this needs to be understood as a claim about an LLM run with temperature = 0. Obviously, if the infra introduces randomness, this result no longer perfectly holds (but there may still be a way to recover it by running a more complex statistical analysis of the results, of course).

Edit: their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.

> their claim may be something more complex, after reading the paper. I'm not sure that their result applies to the final output, or it's restricted to knowing the internal state at some pre-output layer.

It's the internal state; that's what they mean by "hidden activations".

If the claim were just about the output it'd be easy to falsify. For example, the prompts "What color is the sky? Answer in one word." and "What color is the "B" in "ROYGBIV"? Answer in one word." should both result in the same output ("Blue") from any reasonable LLM.

Even that is not necessarily true. The output of the LLM is not "Blue". It is something like "probability of 'Blue' is 0.98131". And it may well be 0.98132 for the other question. Certainly they only talk about the internal state in 1 layer of the LLM, they don't need the entire LLM values.

That's exactly what the quoted answer is saying though?

The point I'm trying to make is this: the LLM output is a set of activations. Those are not "hidden" in any way: that is the plain result of running the LLM. Displaying the word "Blue" based on the LLM output is a separate step, one that the inference server performs, completely outside the scope of the LLM.

However, what's unclear to me from the paper is if it's enough to get these activations from the final output layer; or if you actually need some internal activations from a hidden layer deeper in the LLM, one that does require analyzing the internal state of the LLM.

There are also billions of possible Yes/No questions you can ask that won't get unique answers.

The LLM proper will never answer "yes" or "no". It will answer something like "Yes - 99.75%; No - 0.0007%; Blue - 0.0000007%; This - 0.000031%" etc , for all possible tokens. It is this complete response that is apparently unique.

With regular LLM interactions, the inference server then takes this output and actually picks one of these responses using the probabilities. Obviously, that is a lossy and non-injective process.

If the authors are correct (I'm not equipped to judge) then there must be additional output which is thrown away before the user is presented with their yes/no, which can be used to recover the prompt.

It would be pretty cool if this were true. One could annotate results with this metadata as a way of citing sources.

Why do people not believe that LLMs are invertible when we had GPT-2 acting as a lossless text compressor for a demo? That's based on exploiting the invertibility of a model...

https://news.ycombinator.com/item?id=23618465 (The original website this links to is down but proof that GPT-2 worked as lossless text compressor)

I was under the impression that without also forcing the exact seed (which is randomly chosen and usually obfuscated), even providing the same exact prompt is unlikely to provide the same exact summary. In other words, under normal circumstances you can't even prove that a prompt and response are linked.

I'm under the impression that seed only effects anything if temperature > 0. Or more specifically that the LLM given a sequence of input tokens deterministically outputs the probability for each possible next token, and then the only source of randomness is in the procedure for selecting which of those next tokens to use. And that temperature = 0 means the procedure is "select the most likely one" with no randomness at all.

The seed and the actual randomness is a property of the inferencing infrastructure, not the LLM. The LLM outputs probabilities, essentially.

The paper is not claiming that you can take a dump of ChatGPT responses over the network and figure out what prompts were given. It's much more about a property of the LLM internally.

There is a clarification tweet from the authors:

- we cannot extract training data from the model using our method

- LLMs are not injective w.r.t. the output text, that function is definitely non-injective and collisions occur all the time

- for the same reasons, LLMs are not invertible from the output text

https://x.com/GladiaLab/status/1983812121713418606

From the abstract:

> First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective

I think the "continuous representation" (perhaps the values of the weights during an inference pass through the network) is the part that implies they aren't talking about the output text, which by its nature is not a continuous representation.

They could have called out that they weren't referring to the output text in the abstract though.

Clarification [0] by the authors. In short: no, you can't.

[0] https://x.com/GladiaLab/status/1983812121713418606

Thanks - seems like I'm not the only one who jumped to the wrong conclusion.

I also thought this when I read the abstract. input=prompt output=response does make more sense.

The input isn't the training data, the input is the prompt.

Ah ok, for some reason that wasn't clear for me.

> they are compressing the data beyond the known limits, or they are abstracting the data into more efficient forms.

I would argue that this is two ways of saying the same thing.

Compression is literally equivalent to understanding.

If we use gzip to compress a calculus textbook does that mean that gzip understands calculus?

Finding repetitions and acting accordingly on them could be considered a very basic form of understanding.

To a small degree, yes. GZIP knows that some patterns are more common in text than others - that understanding allows it to compress the data.

But that's a poor example of what I'm trying to convey. Instead consider plotting the course of celestial bodies. If you don't understand, you must record all the individual positions. But if you do, say, understand gravity, a whole new level of compression is possible.

I'm not sure if I would call it "abstracting."

Imagine that you have an a spreadsheet that dates from the beginning of the universe to its end. It contains two columns: the date, and how many days it has been since the universe was born. That's very big spreadsheet with lots of data in it. If you plot it, it creates a seemingly infinite diagonal line.

But it can be "abstracted" as Y=X. And that's what ML does.

That's literally what generalization is.

I don't think it's the same thing because an abstraction is still tangible. For example, "rectangle" is an abstraction for all sorts of actual rectangular shapes you can find in practice. We have a way to define what a rectangle is and to identify one.

A neural network doesn't have any actual conceptual backing for what it is doing. It's pure math. There are no abstracted properties beyond the fact that by coincidence the weights make a curve fit certain points of data.

If there was truly a conceptual backing for these "abstractions" then multiple models trained on the same data should have very similar weights as there aren't multiple ways to define the same concepts, but I doubt that this happens in practice. Instead the weights are just randomly adjusted until they fit the points of data without any respect given to whether there is any sort of cohesion. It's just math.

That's like saying multiple programs compiled by different compilers from the same sources should have very similar binaries. You're looking in the wrong place! Similarities are to be expected in the structure of the latent space, not in model weights.

For sure! Measuring parameters given data is central to statistics. It’s a way to concentrate information for practical use. Sufficient statistics are very interesting, bc once computed, they provably contain as much information as the data (lossless). Love statistics, it’s so cool!

[deleted]

> That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.

Unfortunately, the reality is more boring. https://www.litcharts.com/lit/frankenstein/chapter-18 https://www.cliffsnotes.com/literature/frankenstein/chapter-... https://www.sparknotes.com/lit/frankenstein/sparklets/ https://www.sparknotes.com/lit/frankenstein/section9/ https://www.enotes.com/topics/frankenstein/chapter-summaries... https://www.bookey.app/freebook/frankenstein/chapter-18/summ... https://tcanotes.com/drama-frankenstein-ch-18-20-summary-ana... https://quizlet.com/content/novel-frankenstein-chapter-18 https://www.studypool.com/studyGuides/Frankenstein/Chapter_S... https://study.com/academy/lesson/frankenstein-chapter-18-sum... https://ivypanda.com/essays/frankenstein-by-mary-shelley-ana... https://www.shmoop.com/study-guides/frankenstein/chapter-18-... https://carlyisfrankenstein.weebly.com/chapters-18-19.html https://www.markedbyteachers.com/study-guides/frankenstein/c... https://www.studymode.com/essays/Frankenstein-Summary-Chapte... https://novelguide.com/frankenstein/summaries/chap17-18 https://www.ipl.org/essay/Frankenstein-Summary-Chapter-18-90...

I have not known an LLM to be able to summarise a book found in its training data, unless it had many summaries to plagiarise (in which case, actually having the book is unnecessary). I have no reason to believe the training process should result in "abstracting the data into more efficient forms". "Throwing away most of the training data" is an uncharitable interpretation (what they're doing is more sophisticated than that) but, I believe, a correct one.

I think you are probably right but it's hard to find an example of a piece of text that an LLM is willing to output verbatim (i.e. not subject to copyright guardrails) but also hasn't been widely studied and summarised by humans. Regardless, I think you could probably find many such examples especially if you had control of the LLM training process.

> Wouldn't that mean LLMs represent an insanely efficient form of text compression?

This is a good question worth thinking about.

The output, as defined here (I'm assuming by reading the comment thread), is a set of one value between 0 and 1 for every token the model can treat as "output". The fact that LLM tokens tend not to be words makes this somewhat difficult to work with. If there are n output tokens and the probability the model assigns to each of them is represented by a float32, then the output of the model will be one of at most (2³²)ⁿ = 2³²ⁿ values; this is an upper bound on the size of the output universe.

The input is not the training data but what you might think of as the prompt. Remember that the model answers the question "given the text xx x xxx xxxxxx x, what will the next token in that text be?" The input is the text we're asking about, here xx x xxx xxxxxx x.

The input universe is defined by what can fit in the model's context window. If it's represented in terms of the same tokens that are used as representations of output, then it is bounded above by n+1 (the same n we used to bound the size of the output universe) to the power of "the length of the context window".

Let's assume there are maybe somewhere between 10,000 and 100,000 tokens, and the context window is 32768 (2¹⁵) tokens long.

Say there are 16384 = 2^14 tokens. Then our bound on the input universe is roughly (2^14)^(2^15). And our bound on the output universe is roughly 2^[(2^5)(2^14)] = 2^(2^19).

(2^14)^(2^15) = 2^(14·2^15) < 2^(16·2^15) = 2^(2^19), and 2^(2^19) was our approximate number of possible output values, so there are more potential output values than input values and the output can represent the input losslessly.

For a bigger vocabulary with 2^17 (=131,072) tokens, this conclusion won't change. The output universe is estimated at (2^(2^5))^(2^17) = 2^(2^22); the input universe is (2^17)^(2^15) = 2^(17·2^15) < 2^(32·2^15) = 2^(2^20). This is a huge gap; we can see that in this model, more vocabulary tokens blow up the potential output much faster than they blow up the potential input.

What if we only measured probability estimates in float16s?

Then, for the small 2^14 vocabulary, we'd have roughly (2^16)^(2^14) = 2^(2^18) possible outputs, and our estimate of the input universe would remain unchanged, "less than 2^(2^19)", because the fineness of probability assignment is a concern exclusive to the output. (The input has its own exclusive concern, the length of the context window.) For this small vocabulary, we're not sure whether every possible input can have a unique output. For the larger one, we'll be sure again - the estimate for output will be a (reduced!) 2^(2^21) possible values, but the estimate for input will be an unchanged 2^(2^20) possible values, and once again each input can definitely be represented by a unique output.

So the claim looks plausible on pure information-theory grounds. On the other hand, I've appealed to some assumptions that I'm not sure make sense in general.

> That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.

I have some issues with the substance of this, but more to the point it characterizes the problem incorrectly. Frankenstein is part of the training data, not part of the input.

[dead]