Hey Luke, Our model does exceptionally well on text and images, and in particular, when text and images are mixed together. An example of where this works well would be in E-commerce where you may have a product title, description, and multiple images of the product. When combining that into a single payload using our inputs parameter we find that our model responds really well to adding more images (i.e. retrieval quality moves up as you add 1,2,3....N images). As you pointed out with Google's multimodal model, most jointly trained multimodal embedding models will suffer in the text modality. Amazon used to have a multimodal embedding model, which also took in a very small text payload. We're thinking about Audio / Video as well but nothing for Q2 at least....