This is something I realised late last year while using Claude Code. The LLM shouldn't be the one in control of the workflow, because the LLM can make mistakes, skip steps, hallucinate steps, etc. Its also wasteful of tokens.
I'm a firm believer that a "thin harness" is the wrong approach for this reason and that workflows should be enforced in code. Doing that allows you to make sure that the workflow is always followed and reduces tokens since the LLM no longer has to consider the workflow or read the workflow instructions. But it also allows more interesting things: you can split plans into steps and feed them through a workflow one by one (so the model no longer needs to have as strong multi-step following); you can give each workflow stage its own context or prompts; you can add workflow-stage-specific verification.
Based on my experience with Claude Code and Kilo Code, I've been building a workflow engine for this exact purpose: it lets you define sequences, branches, and loops in a configuration file that it then steps through. I've opted to passing JSON data between stages and using the `jq` language for logic and data extraction. The engine itself is written in (hand coded; the recent Claude Code bugs taught me that the core has to be solid) Rust, while the actual LLM calls are done in a subprocess (currently I have my own Typescript+Vercel AI SDK based harness, but the plan is to support third party ones like claude code cli, codex cli, etc too in order to be able to use their subscriptions).
I'm not quite ready to share it just yet, but I thought it was interesting to mention since it aims to solve the exact problem that OP is talking about.
I‘ve recently started to use skills and so far it’s been working great.
Your agent can write a python script to loop and simply call „claude -p“ or „codex exec“.
For simple workflows this seems good enough and can be set up in 10 minutes without third party software.
What do you think?
For simple workflows or once-off workflows, that's a good approach.
For long running repeatable workflows (eg you want to leave your agent running over night, you want to run the same workflows over and over in different projects, or more autonomous Devin-like workflows) or you want audit trails/observability, vetted workflows (ie not have the LLM write them; or have the LLM write them and you review them) without having to read through scripts, or you have more complex requirements like having different models/providers for different workflow stages or the things I mentioned previously (context, plans, verification, etc), or you have more complex workflow needs (swarms or fork/join, parallel pipelines, routing/branching, error recovery or routing, etc) then a robust dedicated workflow engine is needed in my personal opinion.
I think for most users using claude/codex for themselves on smallish projects, its unnecessary, but was you scale up, I feel that more powerful tools are needed. Also, for corporate, where you need repeatable workflows with audit trails, artefact management, and job queue based task management starts becoming more important too.
I also feel that using a workflow engine as an internal behind-the-scenes system in a GUI-centric vibe coding tool might also help raise the ceiling compared to the existing tools, but I've yet to test that hypothesis. Just because it takes the mistakes out of the users hands: the engine will follow proven workflows, whether you ask it to or not, keeping skills for context/knowledge, not for orchestration.
Something else I've been experimenting with a little, but not enough yet to have an opinion, is small language models running locally for orchestration, and frontier models for doing work.