> Today the app still uses CLIP for embeddings
Have you investigated multimodal embeddings from other models? CLIP is out of date, to put it mildly.
> Today the app still uses CLIP for embeddings
Have you investigated multimodal embeddings from other models? CLIP is out of date, to put it mildly.
For sure. We've been prototyping embedding a vision model in desktop docs, but just needs more dev time to be stable. Went with CLIP for parity, but we are looking to upgrade soon. I tried Siglip and wasn't impressed. Do you know other open-source image embedding models you'd recommend?
nomic-embed-vision-1.5 (https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5) is alignable with nomic-embed-text-1.5 for multimodal retrieval and implements some more modern LLM improvements, although it doesn't solve some of the problems CLIP has.
Given the importance to your business, it may be worthwhile into finetuning a modern native multimodal model like Gemma 3 to output aligned embeddings, albeit model size is a concern.
I love Gemma. I use it on LM studio and am working on getting it into Desktop Docs. Thats for the nomic link. I'll do some testing...