Vision and audio is already in use in multimodal LLMs. So it's possible in the past.
Who said anything about vision and audio?
Who said anything about vision and audio?