From what I understand, it is the STT engine that is the issue - and is in fact not a solved problem at all. Specifically, in a conversation where the microphones hear 3 people talking, 1 of them talking _at_ us, we need to pick out _that_ person only to translate.
If we were using Whisper in that pipeline, we could for example generate speaker embeddings for each segment whisper generates, then group by matching the speaker embeddings - and in reality, this doesn't really work all that well.
But we are still left with the question of _WHO_ to feed to the translation model - so, ideally, the person facing us or talking at us - so we'd have to classify the 3 people all talking to each other given their angle in relation to the listener's head, etc.. This is what the diarization model would have to do - and the more sophisticated diarization models certainly could use the precise angle input can only be computed if you have super-close timings.