1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".
We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.
This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.
We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.
I used to assume they pushed people into the prompt-only workflows because you’re paying them for the tokens, and not paying them for the scaffolding you built. However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it. I do think it’s going to increase productivity enough to disastrously affect developer job market/pay scale, but I just don’t think this particular version of this particular technology is going to actually do what they say it will. If they said they were spending this much money bootstrapping a super useful thingy that can reduce a big chunk of the busy work of a human dev team— what most developers really want, and most executives really don’t— a bunch of investors would make them walk the plank.
I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.
> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects
You can have the AI design the custom harness in advance. It's not especially hard work! In fact, the AI could even come up with the workflow itself; it's a different and much simpler problem than trying to stick to it after-the-fact, with a filled-in context.
> However, I think that they’re really worried about is that a person needs to design and implement that stuff… It throws a wet blanket on their insistence that this will replace entire people in entire workflows or even projects, and I just don’t buy it.
I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.
Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?
And then you realize that what you’re using the smaller models for is ALSO decomposable and part of it is just a few if statements, and then you realize that for this feature you don’t actually need or want a model because the performance, reliability, reproducibility are cheaper and better for you and your users.
So you have the model write the if statements and put itself out of a job.
Indeed, I've been experimenting with agent workflows, for complicated tasks - where I essentially have a graph of agents with different roles/capabilities, including such things as breaking down complex tasks into simpler ones. There seems to be a point where a complex enough task is better performed by a group of cheaper agents/models than by one agent using one of the SOTA big models, in terms of both quality and cost.
The SOTA big models win in world knowledge, that's what all those parameters are for. But a huge fraction of agentic tasks are going to be plain clerical work that needs no special knowledge at all, a much simpler model can do them in a straightforward way.
It is also interesting because you get people with very different use cases arguing about the effectiveness of various models but doing very different things with them.
Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.
Aren't they just buying time to build you whatever harness you need? They want to be the only software engineering shop in the world.
The designing and implementing of a code harness in your workflow can be as simple as running something like /skill-builder.
You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.
You can build this in 10 minutes.
I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.
Google ADK might be useful, especially v2 reorients it around graph operators for control flow.
Your specific case is listed in the v2 docs with an operation that fans out to parallel many tasks then joins the results.
Secret: "compile" that orchestration prompt. Determinism is solved by turning prompts into code that can in turn run agents or run code or both.
Everyone misses this pattern with skills: you can just drop code alongside a SKILL.md to guarantee certain behaviors, but for some reason everyone's addicted to writing prompts. You don't even need to build a CLI. A simple skill.py with tasks does it. You can even have helpers that call `claude -p`!
Exactly this, I tend to work this way. I built an ingestion pipeline to pull concepts out of a novel using Qwen and push them into falkordb this way
Could you elaborate what does "compiling orchestration prompt" mean?
When you get some abstraction working you concretize it in something deterministic, or sort of “cache” that knowledge bit (aka write me a function, class, library, whatever). In the future, the nondeterministic path now has a deterministic piece to lean on as it explores the problem space. Rinse, repeat, eventually you have a mostly deterministic system now. Leave flexibility in space where you need that nondeterminism.
Rather than telling the LLM "loop through these files", tell it "write a script to loop through these files", then hard-code that script somewhere.
a guess but i think they mean take the orchestration prompt and prompt yet another llm to turn that prompt into code..?
> But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc)
couldn't you "just" have it orchestrate a bunch of subagents? a la the superpower skill
definitely a worse solution, non deterministic orchestration + way higher token usage (unless there's a way to hide the subagent output from the orchestrator agent? i haven't used any of these platforms) but could work in some cases
I saw a major uplift in performance after I combined tools like apply_patch with check_compilation & run_unit_tests. I still call the tool "apply_patch", but it now returns additional information about the build & tests if the patch succeeds. The agent went from ~80% success rate to what seems to be deterministic (so far). I don't bother to describe the compilation and unit testing processes in my prompts anymore. All I need to do is return the results of these things after something triggers them to run as a dependency.
I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.
I have but one upvote, but yes. The only way to make these systems work reliably is to break the problems down into smaller chunks. Any internal consistency checks are just going to show you that LLMs are way less consistent than you’d expect.
I had to create a hypothesis testing agent where it gets a query like "is manufacturing parameter x significantly different this month than last month" and have the agent follow a flowchart to run a statistical test and return the answer
At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent
Wouldn't it be more efficient to convert the requirements these 200 markdown files into Playwright tests?
You could still use an LLM to write and extend the tests, but running the tests would be deterministic and would use less resources.
This type of thing so much.
AI is being pushed so much at work right now. For non-dev stuff even. The amount of things that people think are "awesome never seen this" is staggering.
Just because you haven't seen file format X converted to file format Y before and now you asked the LLM to do it and it worked, doesn't mean you needed an LLM for it nor that it's remarkable. The LLM knew how to do it because it learned from a bazillion online sources for deterministic converters that cost nothing (and have open source). But now you're paying, every single time, for a non-deterministic version of it and you find it cool. It's magic ...
But I guess they deserve it.
> It's magic
you'll be surprised with how many people are comfortable attributing something they do not understand to Magic.
more than anything, ai let people who couldn't and wouldn't bother to learn to write simple code, to side step ones who can and build solutions to scratch their own itch. that too faster.
now human behavior kicks in, and they don't want to hand control back into the hands of people who can code to solve problems.
put this together and you have a good model to understand the AI sales pitch... Its magic
like all magic, its but a trick.
Oh, yes! As someone who has dabbled in card tricks, this so much. People don't understand how its done and can't imagine or conceive of a way that it possibly could be done, so they attribute it to literal magic or demons or whatever. Like, no, I just distracted you for a split second and used sleight of hand.
Technology is no different: someone has never even considered that this thing could be possible, and now they see it with their own eyes? Incredible! They don't realise that its mundane and has been possible (in much cheaper ways) for a long time. It was like a few years ago when some journalist posted an animation showing how Horizon Zero Dawn does frustum culling and all the non-tech people were all "wow! This game unloads the game world when its not in view! Incredible!", like... yeah? That's how games have worked since the advent of 3D?
> This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.
Sorry, you thought a prompt was a suitable replacement for a testing suite?
hey man it works great barely and also costs a bunch of money everytime we run it. we also can't trust the results, relax.
If you are invested in AI stocks, this is the way. You are basically funneling money from software companies into your brokerage account. Keep going.
This was a great example, thanks
> This started breaking down after ~30 files.
Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.
Is Codex willing to do "multi hour" tasks when used with a ChatGPT Plus subscription, or does it need something more expensive like Pro?
It's going to work the same regardless of how much you pay, but with Plus you'll run into 5h usage limit rather quickly unless your "multi hour task" spends 90% of the time just waiting around for code to compile. Expect to get an hour or two of active work (single-threaded).
I regularly get codex to do multi hour tasks with a single prompts I don't think thats a big deal anymore. But you don't want a single agent doing all the work. The root agent needs to delegate the work to sub agents. For example, a sub agent for context gathering, then one for planning, then one (or more) for implementation, then another for review. This way the root agent doesn't use up its context window and it just manages from a bird's eye view. I do have the $200 plan though.
If you have any org email, you can get free chatgpt + subscription.
This might be inherent to how the models are benchmarked.
Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?
Good point. We need the mean, “any 1 of 10” and the “all 10 of 10” success rates in the metrics, so we can estimate reliability (the last one).
> We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.
This is cool. Can you elaborate on it? Is it flaky? Does it take a long time?
Isn't this already possible to implement with skills and subagents? Like have a skill saying "to test these files run this script that executes a subagent for every markdown file, then check the results".
I’m working on a hybrid system of old school task graph and ai agents and let them instantiate each other. I think others will do that eventually.
I'm working on something similar (won't link to it as don't want people to think I'm spamming) but if you want to compare notes happy to talk.
Jira for agents?
c.f. Linear for Agents
https://linear.app/agents
The other part of the question is exactly when the "build for the capabilities of future models" becomes the present.
Looking at the Mythos benchmarks, it doesn't seem like the models are that close to being truly reliable for agentic tasks.
Is it a year away, or five? That's a big difference in deciding what to build today.
Our team at Agentforce recently open-sourced our solution to this and we've gotten very valuable feedback -- would love to hear from more of you about it: https://github.com/salesforce/agentscript
No you didn't
"What we're not open sourcing (yet) is the runtime. "
You could have a skill that is the combination of a minimal markdown file and a set of orchestration scripts that do the deterministic work. The agent does not have to “run everything”, it just needs to know how to launch the right script.
So I wonder, if a more powerful agent harness could have the agent basically write and exectute its own deteministic code, which when executed, spawns sub agents for each of the subtasks?
So far we've seen agents spawn subagents directly, but that still means leaving the final flow control to the non-deterministic orchestrator model, and so your case is a perfect example of where it would probably fail.
I've been working on an integrated deterministic/agent integrated system for a few months now. It basically runs an AI step to build a plan, which biases towards deterministic steps as much as possible but escalates back to AI when it needs to (for AI only capabilities or deterministic failures) so effectively (when I perfect it, I'm about 90% there) it can bounce back and forward as needed with deterministic steps launching AI steps and AI steps launching deterministic steps as needed.
Probably not explaining it very well but I think it's pretty effective at reducing token usage.
I've been building a workflow engine for agent orchestration and the workflows are just data for the engine to execute. While I haven't experimented with it yet, I envision that an LLM would be rather good at generating the workflows based on a description of your needs (and context about how best to utilise the workflow engine).
LLM's are pretty good at reasoning about workflows, its just that when they have to apply them directly, the workflow context gets muddled with your actual tasks context. That's why using an orchestration agent that delegates work to worker agents works so much better.
I still think there's a huge amount of value in having the workflow executed in a deterministic way (as code, or by a workflow engine) because it saves tokens, eliminates any possibility of not following it, and unlocks other cool things, like being able to give each step in the workflow its own focused task-specific context, splitting plans into individual actions and feeding them through a workflow one by one, and having workflow-step specific verification.
But that workflow absolutely CAN be created by an LLM, it just shouldn't be executed by one.
[flagged]
I make codex do everything through a giant `justfile`. Simple, greppable, self-documenting, works great, and I don’t even need to read it.
I never tell claude to "go over this bunch of files and do this".
I tell it "write a program that goes over this bunch of files and do this".
Sometimes "do this" can be invoking another claude instance.
I'm personally surprised by this too. Like, everyone is writing how insanely productive AI is making them, but that productivity doesn't seem to have translated into any innovations beyond model quality.
Like, most of the stuff needed to make AI better is stuff that could have been written by hand in 2015, so why hasn't anyone used their agents to do so?
To be fair, there is probably a way to make it work the way you want. You could add an MCP for a task queue and let the model work each item in the task queue. The tasks could be added by a deterministic system i.e. your harness.
Can you not have it write your harness for you, or have it be the first step? You can push your own determinism where you need, surely.
True. The prompt reads: Run the following Python: ```
[flagged]
From the site guidelines (https://news.ycombinator.com/newsguidelines.html):
> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.