This is a big reason why SOTA models are trained multimodal these days. Even when you're using them for text, the knowledge they gain from images and video improves their world models.
This is a big reason why SOTA models are trained multimodal these days. Even when you're using them for text, the knowledge they gain from images and video improves their world models.