Most frontier models are multi-modal and can handle audio or video files as input natively.

I'm experimenting right now with an English to Thai subtitle translator that feeds in the existing English subtitles as well as a mono (centre-weighted) audio extracted using ffmpeg. This is needed because Thai has gendered particles -- word choice depends on the sex of the speaker, which is not recorded in English text. The AIs can infer this to a degree, but they do better when given audio so that they can do speaker diarization.