The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED
The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED
2.4% relative error is not bad.
Reminds me of Babbage making allowance for meter.
"""
"""Shouldn't it be the other way around if the population is increasing? Every minute one is born = 1440 born/day, every minute and a sixteenth ~= 1335 dead/day for a net population increase of 105/day.
It means that in every minute, one and a sixteenth of a man is born.
Wouldn't "one and a sixth" be more accurate in both respects?
how do you decompress all those 4 words from one token?
Not from one token, from one embedding. Text contains a low amount of information: it is possible to compress a few token embeddings into a single tiken embedding.
The how is variable. The calm paper seems to have used a MLP to compress from and ND input (N embeddings of size D) into a single D embedding and other for decompress them back
The mechanism would be prediction (learnt during training), not decompression.
It's the same as LLMs being able to "decode" Base64, or work with sub-word tokens for that matter, it just learns to predict that:
<compressed representation> will be followed by (or preceded by) <decompressed representation>, or vice versa.