LLMs are trained on text. Why would we expect them to understand a visual and tactile 3D world?

Because they’re also multimodal vLLMs.