It’s not that simple - when implementing this one runs into the issue that detecting self speech is a solved problem - BUT detecting the speech of a person talking AT you in a restaurant is not nearly that easy - this is known as diarization. This needs custom models - and I am willing to bet the model for the iPhone is tuned specifically to the AirPods . How would they even provide that? And I’d bet that the customer microphones in the AirPods provide a much better time synced stream to the phone than just a random pair of phones - I’d be willing to bet this is not just Bluetooth, but also out of band clock drift, etc. Which allows for much better phase data - which makes training diarization models simpler- and makes the accurate. So - I’d bet there is a per headset model here - and one that probably requires more than just audio.
The issue the EU has is much simplr than this. They are not requesting Apple to provide a model that works for their competitors headphones, they are requesting they also allow their competitors to run their own models the same way Apple allows the AirPods to.
If Apple did this, and it turns out Apple did tune the model to the extent that it works really well with AirPods and really poorly with any other competitor, would there be any possibility of legal action against Apple (by the competitor or by the EU) on the grounds that the model's training is effectively limiting the functionality?
For example, maybe Apple purposefully trained the model to not only optimizing working with AirPods, but to optimize not working with any other input devices? If Apple could be in trouble if it became known they did such specific training, could it also mean there is the potential for them to have legal issues due the potential for them to have done such?
(I think we can also make a similar argument when it comes to the hardware being made to optimally run one specific model and possible allegations it doesn't run other models as well.)
If the government doesn't crack down on this sort of behavior, a company could use it to try to meet the legal requirements technically while also still hurting competitors. But if the government does crack down on it, then a company could be caught up under an accusation of doing this even if they didn't (they just didn't care to optimize for competitors at all, but never negatively optimized for them either).
Quite a burden to provide an ecosystem. I mean doesn't this extend to anything you want it to extend to? From AirDrop to the complete feature set of the AirTag and FindMy ecosystem. Your non Apple airtag has to show up in FindMy or at least be capable of being added? You have ultra-wideband features for AirTag. You need to make that available too?
If I were Apple, I'd say you got what you want EU, it works on ALL earphones in EU. But it will be absolutely terribly shitty because we will use the same model trained for our AirPods on your random headphones.
You're using a third party BLE airtag and clicking on UWB? Enjoy tracking this approximate noisy location that we're basing off of some noise pattern we didn't lock on.
Feature provided, just not well. Goes against Apple's ethos of trying to make things polished but don't let some bureaucrats weaponize that against you.
Nobody is forcing Apple, the gatekeeper to the iPhone and iOS ecosystem, to also make headphones and compete in that totally separate market, but they are of course free to.
The issue arises when Apple leverages their position as gatekeeper to anticompetitively preference their own headphones in the iPhone/iOS ecosystem. Can't do that.
> If I were Apple, I'd say you got what you want EU, it works on ALL earphones in EU. But it will be absolutely terribly shitty because we will use the same model trained for our AirPods on your random headphones.
The problem for Apple is that they have no secret sauce here: absent any ratfuckery, it would probably work just as well with competing headsets, if not better (particularly since many of Apple's competitors' headsets have better sound quality, better microphone quality, and better noise cancellation). That's probably why they aren't taking your suggestion and are instead choosing anticompetitive behavior.
> The problem for Apple is that they have no secret sauce here: absent any ratfuckery, it would probably work just as well with competing headsets.
Yeah, I'd believe it. There is a good chance that is very much the case here.
Let’s just say there is. Scuttlebutt says there was at least a microphone pick up redesign and a timing redesign because the diarization model loss curve was crap - and given what I hear from the rest of the industry on auto0diarization in conference rooms, I believe that easily. Basically, the AI guys tried to get it working with the standard data they had, and the loss curve was crap no matter how much compute they threw at it. So, they had to go to the HW ppl and say ‘no bueno’ - and someone had to redesign time sync and change a microphone capsule out.
For reference, we are seeing it more and more - sensor design changes to improve loss curve performance - there’s even a term being bandied about : “AI-friendly sensor design”. This does have a nasty side effect of basically breaking abstraction - but that’s the price you pay for using the bitter lesson and letting the model come up with features instead of doing it yourself. (Basically - the sensor->computer abstraction eats details the RL could use to infer stuff)
I'm not sure who scuttlebutt is, but in the architecture of,
audio goes into mic => STT engine => translation model => TTS engine => audio comes out of speaker
a change in hardware would be a change in the "audio goes into mic" component of the model, which is not the critical part of the model.
All the parts of the above architecture already exist: we already have mics, STT, translation models, TTS, and speakers, and they all worked on other systems before apple even announced this, much less came up with a redesign. Most likely the redesign is aesthetic or just has slightly better sound transmission or reception – none of those were necessary for the functioning of the above architecture in other, non-apple systems.
I am, of course, assuming apple's architecture is a rough approximation of above. An alternative theoretical architecture might resemble the one below, but I have seen no evidence apple is doing this.
audio goes into mic => direct audio-to-audio translation model => audio comes out of speaker
From what I understand, it is the STT engine that is the issue - and is in fact not a solved problem at all. Specifically, in a conversation where the microphones hear 3 people talking, 1 of them talking _at_ us, we need to pick out _that_ person only to translate.
If we were using Whisper in that pipeline, we could for example generate speaker embeddings for each segment whisper generates, then group by matching the speaker embeddings - and in reality, this doesn't really work all that well.
But we are still left with the question of _WHO_ to feed to the translation model - so, ideally, the person facing us or talking at us - so we'd have to classify the 3 people all talking to each other given their angle in relation to the listener's head, etc.. This is what the diarization model would have to do - and the more sophisticated diarization models certainly could use the precise angle input can only be computed if you have super-close timings.