Hacker News

Not from one token, from one embedding. Text contains a low amount of information: it is possible to compress a few token embeddings into a single tiken embedding.

The how is variable. The calm paper seems to have used a MLP to compress from and ND input (N embeddings of size D) into a single D embedding and other for decompress them back