> text tokens are discrete while image tokens are continuous. Each model has a finite number of text tokens - say, around 50,000. Each of those tokens corresponds to an embedding of, say, 1000 floating-point numbers. Text tokens thus only occupy a scattering of single points in the space of all possible embeddings. By contrast, the embedding of an image token can be any sequence of those 1000 numbers. So an image token can be far more expressive than a series of text tokens.

This is an interesting point. But if this was true, couldn't we just put a text convolutional nn below the transformer stack to reduce the token count by a similar factor 10?