LLMs were entirely text not that long ago.

Multi modality is new; you won’t have to wait too long until they can do what you’re describing.