No UX needs to be perfect for everyone, but this doesn't sound trivial to make reliable.

First things that came to mind:

  - facial hair
  - getting people to learn to make bigger mouth movements and not mumble
  - we're constantly self-correcting our speech as we hear our voice. This removes the feedback loop.
  - non english languages (god forbid bilingualism)
  - camera angles and head movement
And that thinking about it for 30s. I'm sure there are some really good use cases, but will any research group/company push through for years and years to make it really good even if the response is luck warm ?

>non english languages (god forbid bilingualism)

In my experience, any combination of computers + speech + danish has, so far without exception been terrible. Last time I tested ChatGPT, it couldn't understand me at all. I spoke both in my local dialect and as close to Rigsdansk [π] as I could manage. Unusable performance, and in any case I should be able to talk normally, or there's no point. It was about a year ago - it may have improved but I doubt it. I'm completely done trying to talk to machines.

Pre-emptive kamelåså: https://www.youtube.com/watch?v=s-mOy8VUEBk

[π] https://en.wikipedia.org/wiki/Danish_language#Dialects