Maybe I’m naive but the longest single workflow I ran was maybe 15 minutes. How do you steer agents to run “overnight”? And what is the quality of such execution?
Maybe I’m naive but the longest single workflow I ran was maybe 15 minutes. How do you steer agents to run “overnight”? And what is the quality of such execution?
I'm building https://engine.build
It's meant for the implementation of well defined tasks/specs while orchestrating a review/fix/verify loop.
Every day I have implementations running for hours non stop, it's simply the time it takes to get a proper and well reviewed implementation with LLMs imo.
Usually coding where the closed loop evaluation takes time.
E.g code debugging
This. Very few people are doing this right now (probably because it sucks having 5 copies of your app running in parallel on your laptop), but in the past few months models have gotten really good at testing your running app live. If you have an environment where you can run your full app and models can get it at via playwright and chromium, they can click around, take actions, and actually verify that their code works.
With boxes.dev I've starting pushing agents harder to run the full app and test their work end to end, and send me screenshots as proof. This takes time, sometimes up to 30-40 minutes, but is much more likely to be bug free at the end of the day.
Works well for very well defined task. If you have a really big feature like a front end migration, you can use /plan, and /goal which i think is in most harnesses. You can also use other tools that allow your agent to interact with other terminals(I use an ADE called orca) that has an orca skill where an agent can spin up different sessions(different from subtasks because they share the context and you can chose the harness/model unlike sub agents). Can also read from the terminal, use your browser or computer and task screenshots and after prepare a report or something.
To add to what @nab said, the longest ("overnight") runs are usually after going back and forth to build out a big multi-phase plan doc -- especially when each phase has an extensive manual test plan (agent runs the app in a browser, clicks through the workflow, watches logs, confirms behavior, etc).
These can go for many hours from all the manual testing and debugging. Quality really depends on how much you spec things out beforehand, and how you define the test plan / "success" gates. If the agent can't even run the app to test it then things can definitely go off the rails!
I think they are just bullshitting.
In codex, is you use /goal it can go for a while. I've never seen overnight but > 1 hr is common
"build me a 10 million dollar MRR saas, make no mistakes"