I was thinking in the same direction when I wrote my 5000 word analysis on the current state of VLAMs. https://jdsemrau.substack.com/p/visual-language-action-model...