reminds me of the difference between fasttext and word2vec.
fasttext can learn words it haven't seen before by combining words from ngrams, word2vec can learn better meaning of the whole words, but then missing out on the "unknown words".
image tokens are "text2vec" here, while text tokens are a proxy towards building a text embedding of even unseen before texts.