what vision/llm model you use???

For now we're using CLIP. We've also done testing with Siglip and Gemma for a full-blown vision model.