> Wouldn't that mean LLMs represent an insanely efficient form of text compression?
This is a good question worth thinking about.
The output, as defined here (I'm assuming by reading the comment thread), is a set of one value between 0 and 1 for every token the model can treat as "output". The fact that LLM tokens tend not to be words makes this somewhat difficult to work with. If there are n output tokens and the probability the model assigns to each of them is represented by a float32, then the output of the model will be one of at most (2³²)ⁿ = 2³²ⁿ values; this is an upper bound on the size of the output universe.
The input is not the training data but what you might think of as the prompt. Remember that the model answers the question "given the text xx x xxx xxxxxx x, what will the next token in that text be?" The input is the text we're asking about, here xx x xxx xxxxxx x.
The input universe is defined by what can fit in the model's context window. If it's represented in terms of the same tokens that are used as representations of output, then it is bounded above by n+1 (the same n we used to bound the size of the output universe) to the power of "the length of the context window".
Let's assume there are maybe somewhere between 10,000 and 100,000 tokens, and the context window is 32768 (2¹⁵) tokens long.
Say there are 16384 = 2^14 tokens. Then our bound on the input universe is roughly (2^14)^(2^15). And our bound on the output universe is roughly 2^[(2^5)(2^14)] = 2^(2^19).
(2^14)^(2^15) = 2^(14·2^15) < 2^(16·2^15) = 2^(2^19), and 2^(2^19) was our approximate number of possible output values, so there are more potential output values than input values and the output can represent the input losslessly.
For a bigger vocabulary with 2^17 (=131,072) tokens, this conclusion won't change. The output universe is estimated at (2^(2^5))^(2^17) = 2^(2^22); the input universe is (2^17)^(2^15) = 2^(17·2^15) < 2^(32·2^15) = 2^(2^20). This is a huge gap; we can see that in this model, more vocabulary tokens blow up the potential output much faster than they blow up the potential input.
What if we only measured probability estimates in float16s?
Then, for the small 2^14 vocabulary, we'd have roughly (2^16)^(2^14) = 2^(2^18) possible outputs, and our estimate of the input universe would remain unchanged, "less than 2^(2^19)", because the fineness of probability assignment is a concern exclusive to the output. (The input has its own exclusive concern, the length of the context window.) For this small vocabulary, we're not sure whether every possible input can have a unique output. For the larger one, we'll be sure again - the estimate for output will be a (reduced!) 2^(2^21) possible values, but the estimate for input will be an unchanged 2^(2^20) possible values, and once again each input can definitely be represented by a unique output.
So the claim looks plausible on pure information-theory grounds. On the other hand, I've appealed to some assumptions that I'm not sure make sense in general.
> That's why an LLM (I tested this on Grok) can give you a summary of chapter 18 of Mary Shelley's Frankenstein, but cannot reproduce a paragraph from the same text verbatim.
I have some issues with the substance of this, but more to the point it characterizes the problem incorrectly. Frankenstein is part of the training data, not part of the input.