Hacker News

> LLMs are awful at the spatial stuff

Could someone please elaborate on this? This is intriguing

In general, text isn’t a great medium for transmitting spatial info. That’s why it’s easy for a model to understand an image but hard for it to draw an SVG of that image.

int_19h 2 months ago [ - ]

This is a big reason why SOTA models are trained multimodal these days. Even when you're using them for text, the knowledge they gain from images and video improves their world models.