Hacker News

nomic-embed-vision-1.5 (https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) is alignable with nomic-embed-text-1.5 for multimodal retrieval and implements some more modern LLM improvements, although it doesn't solve some of the problems CLIP has.

Given the importance to your business, it may be worthwhile into finetuning a modern native multimodal model like Gemma 3 to output aligned embeddings, albeit model size is a concern.