Yes I'm still not really understanding this "run agents overnight" thing. Most of the time if I use claude it's done in 5-20 minutes. I've never wanted to have work done for me overnight...tomorrow is already plenty of time for more work, it's not going anywhere, and my employer isn't paying me to produce overnight.
The only counter I have to this is that there are some workflows that have test environments, everything can't or shouldn't just run locally. Sometimes these test take time, and instead of babysitting the model to write code and run the build+deploy+test manually, you can send it off to work until the kinks are worked out.
Add to that I have worked on many projects that take more than 20 minutes to fully build and run tests... unfortunately. And I would consider that part of the job of implementing a feature, and to reduce cycles I have to take.
After the "green" signal I will manually review or send off some secondary reviews in other models. Is it wasteful? Probably. But its pretty damn fun (as long as I ignore the elephant in the room.)
Yeah, our basic integration test suite takes over 20 minutes to run in CI, likely higher locally but I never try to run the full test suite locally. That doesn't even encapsulate PDVs and other continuous testing that runs in the background.
The other day, I wrote a claude skill to pull logs for failing tests on a PR from CI as a CSV for feeding back into claude for troubleshooting. It helped with some debugging but was very fraught and needed human guidance to avoid going in strange directions. I could see this "fix the tests" workflow instrumented as overnight churn loops that are forbidden from modifying test files that run and have engineers review in the morning if more tests pass.
Maybe agentic TDD is the future. I have a bit of a nightmare vision of SWEs becoming more like QA in the future, but with much more automation. More engineering positions may become adversarial QA for LLM output. Figure out how to break LLM output before it goes to prod. Prove the vibe coded apps don't scale.
In the exercise I described above, I was just prompt churning between meetings (having claude record its work and feeding it to the next prompt, pulling test logs in between attempts), without much time to analyze, while another engineer on my team was analyzing and actually manually troubleshooting the vibe coded junk I was pushing up, but we fixed over 100 failing integration tests in a week for a major refactor using claude plus some human(s) in the loop. I do believe it got things done faster than we would have finished without AI. I do think the quality is slightly lower than would have been if we'd had 4 weeks without meetings to build the thing, but the tests do now pass.
Yes that's fair, but not the case for me. Everything can run locally and specs run quickly for covering things claude changes. For everything else, the GitHub CI run is 10-15m and catches any outlier failures, and I'm usually working on more than one thing at a time anyway so it doesn't really matter to wait for this.
Its for when you want to write an spec like "make me a todo list app", then tell your agent of choice to go have fun, and return in the morning to a fully finished app, and not care about what the code is actually doing
I’ve been playing around with these kinds of prompts. My experience is that the prompts need a lot of iteration to truly one-shot something that is halfway usable. If it’s under-spec’d it’ll just return after 15-20 minutes with something that’s not even half baked. If I give it an extremely detailed spec it’ll start dropping requirements and then finish around the 60-70 minute mark, but I needed 20 minutes to write the prompt and I need to hunt for the things it didn’t bother to do.
I’ve gotten some success iterating on the one-shot prompt until it’s less work to productionize the newest artifact than to start over, and it does have some learning benefits to iterate like this. I’m not sure if it’s any faster than just focusing on the problem directly though.
The dropping requirements problem is real. What's helped us is breaking the spec into numbered ACs and having the verification run per-criterion. If AC-3 fails you know exactly what got dropped.
I'll try that out, thanks for the tip!