This is super interesting. But I have to wonder how much it costs on the back end - it sounds like it’s essentially just running a boatload of specialized agents, constantly, throughout the whole interaction (and with super-token-rich input for each). Neat for a demo, but what would it cost to run this for a 30 minute job interview? Or a 7 hour deposition?
Another concern I’d have is bias. If I am prone to speaking loudly, is it going to say I’m shrill? If my camera is not aligned well, is it going to say I’m not making eye contact?
So the conversational agent runs on a provisioned chunk of compute already, but that chunk isn't utilized to 100% of its provisioned capacity. For this perception system we're taking advantage of the spare compute left on what's provisioned for a top-level agent, so turning this on costs nothing "extra"
Bias is a concern for sure, though it adapts to your speech pattern and behaviors in the duration of a single conversation, so ack'ing you not making eye contact because say your camera is on a different monitor, it'll make the mistake once and not refer to that again.