For now we're using CLIP. We've also done testing with Siglip and Gemma for a full-blown vision model.