Because they’re also multimodal vLLMs.