Vision and audio is already in use in multimodal LLMs. So it's possible in the past.

Who said anything about vision and audio?