1 & 2 are totally dependent on the company being willing to let their agents do things that they haven’t traditionally let humans do. For example, issue refunds, or do things that cost money but generate good will. I am skeptical that companies will be OK with their agents doing those things on their own volition.

3. Cool so the user didn’t indicate if they were satisfied. What then?

4. You can’t use a SOTA model right now for reasoning, there’s too much latency for a conversation. So you’re either using an older, but significantly less capable model, or you’re paying out the nose for fast mode. If the former then you can’t trust the agent to do the right thing (see points 1&2). If the latter, there’s no cost savings over a human. So which is it?

1&2 are already happening, these startups take on brand liability and trust to do so

3 depends on how companies want to measure it, but lack of user submitting satisfaction score is not a good thing

you can use a model w/o reasoning, + use various tricks to simulate low latency

At the end of the day the company is going to audit what the agent has done. If the agent issues too many refunds that's a major red flag for the company providing the agent and likely results in the contract being terminated. I don't see how anyone can underwrite what agents are going to do today given that they're still so susceptible to prompt injection.

You didn't address my concern, non-reasoning models are so, so variable in their output.

1. part of the moat is their guardrails and obviously they are audited and tracked. there are agents issuing refunds and more at scale right now so not sure where the skepticism comes from.. you're free to try and jailbreak them

2. another part of the value prop of these companies is figuring out how to construct the proper harness to take advantage of the lower latency of faster models while shoring up the weaker intelligence, how you blend deterministic and non-deterministic behaviors, compliance etc.

its a hard problem which is why f500 is willing to pay up

I’m curious where you see models like Codex-Spark in this problem? I know they’re too expensive and availability is too limited right now, but in a few years…

Yes you could , not everything needs to be real time , anyways you listen for the music sometimes 30 mins plus

[deleted]