Yeah, you've exactly captured one of the main problems with the model being relentlessly proactive: it will happily burn like $5 of tokens to avoid asking the human to take a screenshot or click a button for it.
Yeah, you've exactly captured one of the main problems with the model being relentlessly proactive: it will happily burn like $5 of tokens to avoid asking the human to take a screenshot or click a button for it.
I think providing proper token-efficient tools for agents will become even more important now.
I'm actually very happy about this. Babysitting the agent just in case it needs me to do something is a terrible use of my time. I've always had to be very explicit about the various ways that it can get an automated feedback loop going to check its work, and now Fable doesn't even need that hand holding. Really great improvement all around.
Have you ever wondered this would end up costing more than a competent offshore developer with more frugal harness/model?
You still need a competent developer for the prompting, planning, etc. But once it's running, I want to avoid mental context switches and just have it run
Giving it access to a cheap human who is just there to take screenshots, do QA, give UX feedback sounds like a good idea in principle. It's non-trivial to set up, but I wouldn't be surprised if some companies this becomes a thing. The return of the QA department, just that they now get to do the agent's bidding in addition to checking if the results work
Have you tried instructing it not to do that? Something like "do not branch into side projects or hacky solutions to obtain information you could ask me for. For example: if you need a screenshot of the issue, just ask me to take a screenshot rather than find a way to reproduce and screenshot it."
I used to complain about all the levels of indirection of modern software, running in a javascript jit, in a browser container, in a vm, on an os, etc.
I eventually just accepted it, but this new agent layer really takes things to a new level.
Ha, you just gave me an idea. Add to the prompt “do not do things that will burn over X tokens if the human operator can do it in less than X min, ask for it”.
I wonder if LLMs can estimate effort in tokens?
I just say "if you need something specific or have any questions, stop and ask me for it".
Honestly Claude straight up ignores my input sometimes, preferring to instead run commands for output and processing that and burning through a series of tokens when thinking hard about whether to ignore me.
Like today, I told Claude exactly the name of the folder it had mistaken (it was supposed to be prod, not production), and it disregarded my input to then examine the directory itself. Small example of the kind of things it's been doing lately but that's top of mind.
Almost if this was _intentional_... maybe related to Anthropic still not being profitable and burning thru wads of cash every day.
The conspiracy theorist in me says that LLM providers do this regularly (or at least, don't bother optimizing for it) beyond some arbitrary "$/task" metric. I am not sure of there is enough SOTA model competition to avoid this.