Hacker News

I'm not sure who scuttlebutt is, but in the architecture of,

audio goes into mic => STT engine => translation model => TTS engine => audio comes out of speaker

a change in hardware would be a change in the "audio goes into mic" component of the model, which is not the critical part of the model.

All the parts of the above architecture already exist: we already have mics, STT, translation models, TTS, and speakers, and they all worked on other systems before apple even announced this, much less came up with a redesign. Most likely the redesign is aesthetic or just has slightly better sound transmission or reception – none of those were necessary for the functioning of the above architecture in other, non-apple systems.

I am, of course, assuming apple's architecture is a rough approximation of above. An alternative theoretical architecture might resemble the one below, but I have seen no evidence apple is doing this.

audio goes into mic => direct audio-to-audio translation model => audio comes out of speaker

From what I understand, it is the STT engine that is the issue - and is in fact not a solved problem at all. Specifically, in a conversation where the microphones hear 3 people talking, 1 of them talking _at_ us, we need to pick out _that_ person only to translate.

If we were using Whisper in that pipeline, we could for example generate speaker embeddings for each segment whisper generates, then group by matching the speaker embeddings - and in reality, this doesn't really work all that well.

But we are still left with the question of _WHO_ to feed to the translation model - so, ideally, the person facing us or talking at us - so we'd have to classify the 3 people all talking to each other given their angle in relation to the listener's head, etc.. This is what the diarization model would have to do - and the more sophisticated diarization models certainly could use the precise angle input can only be computed if you have super-close timings.