What about a model designed for robotics and vision? Seems like an LLM trained on text would inherently not be great for this.

DeepMinds other models however might do better?