But does this not miss the "context" that the embeddings of the text tokens carry? An LLM embedding of a text token has a compressed version of the entire preceding set of tokens that came before it in the context. While the image embeddings are just representations of pixel values.
Sort of at the level of word2vec, where the representation of "flies" in "fruit flies like a banana" vs "time flies like an arrow" would be the same.