In general, text isn’t a great medium for transmitting spatial info. That’s why it’s easy for a model to understand an image but hard for it to draw an SVG of that image.
This is a big reason why SOTA models are trained multimodal these days. Even when you're using them for text, the knowledge they gain from images and video improves their world models.
In general, text isn’t a great medium for transmitting spatial info. That’s why it’s easy for a model to understand an image but hard for it to draw an SVG of that image.
This is a big reason why SOTA models are trained multimodal these days. Even when you're using them for text, the knowledge they gain from images and video improves their world models.