Snake oil. Good to read for sure. Seems all plausible too. But snake oil nevertheless.
Here's why: The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
These harnesses approaches pretend as if LLMs are strict and perfect rule followers and the only problem is not being able to specify enough rules clearly enough. That's fundamental cognitive lapse in how LLMs operate.
That leaves only one option not reliable but more reliable nevertheless: Human review and oversight. Possibly two of them one after the other.
Everything else is snake oil but at that point, you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
Snake oil may be a bit strong, because snake oil never works (except maybe as placebo?) whereas anything with an LLM, even though stochastic, has a pretty high chance of working.
> ... you also realize that promised productivity gains are also snake oil because reading code and building a mental model is way harder than having a mental model and writing it into code.
Not really, though it depends on the code; reading code is a skill that gets easier with practice, like any other. This is common any time you're ever in a situation where you're reading much more code than writing it (e.g. any time you have to work with a large, sprawling codebase that has existed long before you touched it.)
What makes it even easier, though, is if you're armed with an existing mental model of the code, either gleaned through documentation, or past experience with the code, or poking your colleagues.
And you can do this with agents too! I usually already have a good mental model of the code before I prompt the AI. It requires decomposing the tasks a bit carefully, but because I have a good idea of what the code should look like, reviewing the generated code is a breeze. It's like reading a book I've read before. Or, much more rarely, there's something wrong and it jumps out at me right away, so I catch most issues early. Either way the speed up is significant.
> has a pretty high chance of working.
for MVPs, mock ups, prototypes or in the hands of an expert coder. You can't let them go unsupervised. The promise of automated intelligence falls far short of the reality.
Pretty high chance isn’t what the intent or impression the end user often has.
Indeed, and it is a complicated problem to solve. A GUI or CLI can hide footguns or make them less likely to be misused. But an AI agent is perfectly happy to use a wrecking ball to put a nail without any second thought or confirmation.
Not only "has a high chance of working", but you can pay more to make it more reliable. It really is striking trying to run a harness openClaw thing on a smaller or quantised model, really makes you realise how much we take for granted from SOTA models that was totally impossible just a year ago, in terms of complex, generally reliable tool use.
Humans also drop any hard requirements you specify regularly, and similarly require review. Nevertheless we manage to increase reliability of human output through processes and reviews, and most of the methods we use for harnesses are taken from experience with how to reduce reliability issues in humans, who are notoriously difficult to ensure delivers reliably.
The primary way to increase reliability is to automate. Instead of humans producing some output manually, humans producing machines which produce that output.
I've seen a disturbing trend where a process that could've been a script or a requirement that could've been enforced deterministically is in fact "automated" through a set of instructions for an LLM.
Sure, when that is possible. However, there are lots of processes we don't know how to automate in a deterministic way. Hence the vast amount of investment in building organisations of people with mechanism to make peoples output more reliable through structure, reviews, and so on.
Large parts of human civilization rests on our ability to make something unreliable less unreliable through organisational structure and processes.
We resolve that through liability, penalties, trust, responsibility, review and oversight.
At the end of the day, if I am spending X$s for automation, I want to be able to sleep at night knowing my factory will not build a WMD or delete itself.
If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
Liability, penalties, trust, and responsibility are means we use to try to influence the application of the processes that do. They do not directly affect reliability. They can be applied just as much to a team using AI as one that does not.
Review and oversight does address reliability directly, and hence why we make use of those in processes to improve the reliability of mechanical processes as well, and why they are core elements of AI harnesses.
> If its simply a tool that is a multiplier for experts, then do I really need it? How much does it actually make my processes more efficient, faster, or more capable of earning revenue?
You can ask the same thing about all the supporting staff around the experts in your team.
> There is a LOT that is forgiven when tech is new - but at some point the shiny newness falls off and it is compared to alternatives.
Only teams without mature processes are not doing that for AI today.
Most of the deployments of AI I work on are the outcome of comparing it to alternatives, and often are part of initiatives to increase reliability of human teams jut as much as increasing raw productivity, because they are often one and the same.
Underrated comment.
So many applications of LLMs have even to start with deterministic brain when using a non-deterministic llm and then wonder why it’s not working.
it's strange to see software engineers using skills aka human description of small scripts instead of scripting things directly. often there were cli / tools / libraries to do what a skill does for many years. maybe it's culture issue, people who enjoy automation / devops / predictability will naturally help themselves, but other people just want to "delegate" and be done without trying.
[flagged]
Because certain aspects (both are error prone) are similar and comparable. The notion that two entities need to be close in abilities for it to be possible to compare them is nonsense.
You make the point for me: We managed to put men on the moon despite humans being enormously unreliable and error prone, because we built system around them that allowed for harnessing the good bits and reducing the failures to acceptable levels.
We are - I am anyway - using our lessons from building reliable systems from unreliable elements to raise the reliability of outputs of LLMs the same way.
> We are - I am anyway - using our lessons from building reliable systems from unreliable elements to raise the reliability of outputs of LLMs the same way.
:) :) :) I could tell immediately you are somehow vested in the "success" of the LLM. So 600 B dollars and five years later, can you tell me how far did you guys get? Apollo programme costed a tiny fraction of that and started putting people on the moon some ~10 years later. Would you say that you are on the way to accomplish something similar in the next five years?
Calm down. They were comparing a very specific and narrow aspect of both. Not totally equivalent maybe, but that doesn't justify a tantrum.
I am incredibly calm. I just wonder at the idiots who think they should compare the magnificiently efficient human brain to the shitslop machines.
Everything you say is all possible, and in theory I agree with you.
However, I have been using spec-kit (which is basically this style of AI usage) for the last few months and it has been AMAZING in practice. I am building really great things and have not run into any of the issues you are talking about as hypotheticals. Could they eventually happen? Sure, maybe. I am still cautious.
But at some point once you have personally used it in practice for long enough, I can't just dismiss it as snake oil. I have been a computer programmer for over 30 years, and I feel like I have a good read on what works and what doesn't in practice.
We can build all the scaffolding around but I assure you that the LLMs aren't perfect rule following machines is the fundamental problem here and that would remain.
Give it a few more months and I'm sure you'll see some of what I see if not all.
I'm saying all the above having all sorts of systems tried and tested with AI leading me to say what I said.
I have been doing this for 6 months or so now, and I am not sure that even if you have a lot more experience than me that it would make your assessment more accurate, since that just means you have more experience with prior generations of the models. What I have experienced is that the AI has been getting better and better, and is making fewer and fewer mistakes.
Now, part of that is my advancements as well, as I learn how to specify my instructions to the AI and how to see in advance where the AI might have issues, but the advancements are also happening in the models themselves. They are just getting better, and rapidly.
The combination of getting better at steering the AI along with the AI itself getting better is leading me to the opposite conclusion you have. I have production systems that I wrote using spec-kit, that have been running in production for months, and have been doing spectacularly. I have been able to consistently add the new features that I need to, without losing any cohesion or adherence to the principals i have defined. Now, are there mistakes? Of course, but nothing that can't be caught and fixed, and not at a higher rate than traditional programming.
> LLMs aren't perfect rule following machines is the fundamental problem here
I kind of get what you're saying, but let us not pretend that SW engineers are perfect rule followers either.
Having a framework to work within, whether you are an LLM or a human, can be helpful.
i think it depend on your goals and also your preference / expectation how your experience with LLMs is. i dont mind if they hallucinate. even if i have mental model of code i wont write it myself perfectly either.
the only downside i see is getting out of practice, which is why for my passion projects i dont use it. work is just work and pressing 1 or 2 and having 'good enough' can be a fine way to get through the day. (lucky me i dont write production code ;D... goals...)
> Give it a few more months
By that time, they will have realized immense value before seeing some of what you see. Sounds like an endorsement of spec-kit.
> The slot machine can drop any hard requirement that you specifically in your AGENTS.md, memory.md or your dozens of skill markdowns. Pretty much guaranteed.
Indeed. That said, I’ve had some success with agent skills, but I use them to make the LLM aware of things it can do using specific external tools. I think it is a really bad idea to use this mechanism to enforce safety rules. We need good sandboxing for this, and promises from a model prone to getting off the rails is not a good substitute.
But I have taught my coding agent to use some ad hoc tools to gather statistics from a directory containing experimental data and things like that. Nobody is going to fine tune a LLM specifically for my field (condensed matter Physics) but using skills I still can make it useful work. Like monitoring simulations where some runs can fail for various reasons and each time we must choose whether to run another iteration or re-start from a previous point, based on eyeballing the results ("the energy is very strange, we should restart properly and flag for review if it is still weird", this sort of things). I don’t give too many rules to the agent, I just give it ways of solving specific problems that may arise.
I hope the only reason people are pretending these markdown suggestions are a "workflow" is fear that a more structured approach will be obsolete by the time it's polished. I can't imagine the pace of innovation with the underlying models will stay like this forever.
I hope to see harnesses that will demand instead of ask. Kill an agent that was asked to be in plan mode but did not play the prescribed planning game. Even if it's not perfect, it'd have to better than the current regime when combined with a human in the loop.
Don't let the perfect be the enemy of the good. Of course we know the AGENTS.md and skills aren't 100% effective. But no, it doesn't mean that they're 0% effective.
A slot machine isn't snake oil.
Slot machine give you rewards when star aligns, snake oil never do :)
All this said, I quite like the mental model of documenting a simple process, and I suspect our future ai overlords will find it useful that I have a series of md files that outline my preferences and processes for certain tasks.
I am not however going to share any of this with work colleagues and make myself redundant.
I can see why this would seem to be “snake oil” logically. However, this approach does work in reality. Your comment just shows that you seem inexperienced with using generative AI.