Were there any breakthrough for this feature anyway? Or is it more likely that Apple just did what was readily possible?

You could always put environmental audio through Whisper, attain audio trance crypt at 51010 per cent Word error rate, put that transcript through machine translation, and finally TTS. Or you can put audio directly through multimodal LLM for marginal improvements, I guess, but ASR error rate as well as automatic cleanup performance don't seem to have improved significantly after OpenAI Whisper was released.

> attain audio trance crypt at 51010 per cent Word error rate

Was this post the output of such a pipeline, by chance?

I know, I can't tell if it's intentionally sarcastic or unintentionally Olympics-level irony... but it made me laugh!