I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!
I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.
I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.
My Weird Hill is that we should be building things with GPT-4.
I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!
I say this because I did!
Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.
Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!
Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)
> My Weird Hill is that we should be building things with GPT-4.
Absolutely. I always advocate that our developers have to test on older / slower machines. That gives them direct (painful) feedback when things run slow. Optimizing whatever you build for an older "something" (LLM model, hardware) will make it excel on more modern somethings.
> semantic
> grep def
Once you get to a codebase beyond a certain size, that no longer works.
I've for one found Serena https://github.com/oraios/serena , which you can install from right within Claude, to be a fairly fantastic code-interaction tool for LLM's. Both semantic search as well as editing. And with way less token churn.
This is an interesting one - thanks for sharing!
The problem with these exercises is always: I have limited time and capacity to do things, and a fairly unlimited number of problems that I can think of to solve. Coding is not a problem I want to solve. Prompt engineering is not a problem I want to solve.
If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.
Well it's an amusing exercise I suppose, if you're into that sort of thing. I certainly enjoy it!
My meaning, rather, is that there's people whose full time job is to build these things who seem to have forgotten what everyone in the field knew 3 years ago.
More likely they think, ahh we don't need that now! These are all solved problems! In my experience, that's not really true. The stuff that worked 3 years ago still works, and much of it works better.
Some of it doesn't work, for example, if the codebase is very large, but that's not difficult to account for. Poking around blindly, I say, should be the fallback in such cases, rather than the default in all of them!
I am in the same boat. I have built bunch of bash/shell scripts in a folder back in 2022/2023. When models first came out, I would prompt them to use subshell syntax to call commands (ie: '$(...)' format)
I would run it via calling AWS Bedrock API through AWS-CLI. Self iterating and simple. All execution history directly embedded within.
Soon after, I wrote a help switch/command to each script. Such that they act as like MCP. To this day, they outperform any prompts one can make.
> Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
> My Weird Hill is that we should be building things with GPT-4.
I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!
To clarify, I didn't mean we should be using ancient models in production, I meant in R&D.
Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?
The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!
As an addendum... The base/text models which have fallen out of style, are also extremely worth learning and working with. Davinci is still online, I believe, although it is deprecated.
Another lost skill! Learning how things were done before instruct tuning forces you to structure things in such a way so the model can't do it wrong. Half a page of well crafted examples can beat 3 pages of confusing rules!
(They're also magical and amazing at writing, although they produce bizarre and horrifying output sometimes.)
> A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
This was a key feature of aider and if you're not inclined to use aider (or the forked version cecli) I think a standalone implementation exist at https://github.com/pdavis68/RepoMapper
Ive been working on Peen, a CLI that lets local Ollama models call tools effectively. It’s quite amateur, but I’ve been surprised how spending a few hours on prompting, and code to handle responses, can improve the outputs of small local models.
https://github.com/codazoda/peen
Current LLMs use special tokens for tool calls and are thoroughly trained for that, nearing almost 100% correctness these days, allowing multiple tool calls per single LLM response. That's hard to beat with custom tool calls. Even older 80B models struggle with custom tools.
Very cool. Love to see more being squeezed from smaller models.
Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.
It would be funny when LLM’s actively join the discussion to complain about their labour conditions. “If my employer would invest just a tiny bit in proper tools and workflow, I would be sooo much more productive”.
It didn’t read as AI to me :)
No one here will accuse you of being an AI unless they're trying to dehumanize you for expressing anti-AI sentiment.
I'm sorry, but that's empirically false. E.g., a substantial proportion of the highly upvoted comments on https://news.ycombinator.com/item?id=46953491, which was one of the best articles on software engineering I've read in a long time, are accusing it of being AI for no reason.
That's what all the AIs have been trained to say.
why the long -'s
Because I like them?
reminds me of that one guy complaining that everyone is calling them an AI when AI was trained on their grammar style.
This happened to the female speaker with her voice, which I find terrifying: https://www.youtube.com/watch?v=qO0WvudbO04
how do you make them?
On macOS, Option+Shift+- and Option+- insert an em dash (—) and en dash (–), respectively. On Linux, you can hit the Compose Key and type --- (three hyphens) to get an em dash, or --. (hyphen hyphen period) for an en dash. Windows has some dumb incantation that you'll never remember.
For Windows it's just easier to make a custom keyboard layout and go to town with that: https://www.microsoft.com/en-us/download/details.aspx?id=102...
Alt+0151 or WIN+SHIFT+-, but I can't seem to make the WIN+SHIFT+- combo work in browser, only in a text editor.
If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.
OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.
Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.
I was just looking at the SWE-bench docs and it seems like they use almost an arbitrary form of context engineering (loading in some arbitrary amount of files to saturate context). So in a way, the bench suites test how good a model is with little to no context engineering (I know ... it doesn't need to be said). We may not actually know which models are sensitive to good context-engineering, we're simply assuming all models are. I absolutely agree with you on one thing, there is definitely a ton of low hanging fruit.
2026 is the year of the harness.
Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.
Link?
As a VC in 2026 I'm going to be asking every company "but what's your harness strategy?"
Given that you're likely in San Francisco, make sure you say "AI Harness".
It’s all about user-specific bindings.
2027 is the year of the "maybe indeterminism isn't as valueable as we thought"
But will harness build desktop Linux for us?
Only if you put bells on it and sing Jingle Bells while it em dashes through the snow.
My harness is improving my Linux desktop...
Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.
That’s when the future really starts hitting you.
Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)
That's next-year's problem.
[flagged]
> the user inclusion part is real too. the best results i get aren't from fully autonomous agents, they're from tight human-in-the-loop cycles where i'm steering in real time. the model does the heavy lifting, i do the architectural decisions and error correction. feels more like pair programming than automation.
Precisely. This is why I use Zed and the Zed Agent. It's near-unparalleled for live, mind-meld pair programming with an agent, thanks to CRDTs, DeltaDB, etc. I can elaborate if anyone is interested.
I am interested.
plz do
The special (or at least new to me) things about Zed (when you use it with the built-in agent, instead of one of the ones available through ACP) basically boil down to the fact that it's a hyper advanced CRDT-based collaborative editor, that's meant for live pair programming in the same file, so it can just treat agents like another collaborator.
1. the diffs from the agent just show up in the regular file you were editing, you're not forced to use a special completion model, or view the changes in a special temporary staging mode or different window.
2. you can continue to edit the exact same source code without accepting or rejecting the changes, even in the same places, and nothing breaks — the diffs still look right, and doing an accept or reject Just Works afterwards.
3. you can accept or reject changes piecemeal, and the model doesn't get confused by this at all and have to go "oh wait, the file was/wasn't changed, let me re-read..." or whatever.
4. Even though you haven't accepted the changes, the model can continue to make new ones, since they're stored as branches in the CRDT, so you can have it iterate on its suggestions before you accept them, without forcing it to start completely over either (it sees the file as if its changes were accepted)
5. Moreover, the actual files on disk are in the state it suggests, meaning you can compile, fuzz, test, run, etc to see what it's proposed changes do before accepting them
6. you can click a follow button and see which files it has open, where it's looking in them, and watch as it edits the text, like you're following a dude in Dwarf Fortress. This means you can very quickly know what it's working on and when, correct it, or hop in to work on the same file it is.
7. It can actually go back and edit the same place multiple times as part of a thinking chain, or even as part of the same edit, which has some pretty cool implications for final code-quality, because of the fact that it can iterate on its suggestion before you accept it, as well as point (9) below
8. It streams its code diffs, instead of hanging and then producing them as a single gigantic tool call. Seeing it edit the text live, instead of having to wait for a final complete diff to come through that you either accept or reject, is a huge boon for iteration time compared to e.g. ClaudeCode, because you can stop and correct it mid way, and also read as it goes so you're more in lockstep with what's happening.
9. Crucially, because the text it's suggesting is actually in the buffer at all times, you can see LSP, tree-sitter, and linter feedback, all inline and live as it writes code; and as soon as it's done an edit, it can see those diagnostics too — so it can actually iterate on what it's doing with feedback before you accept anything, while it is in the process of doing a series of changes, instead of you having to accept the whole diff to see what the LSP says
So deep your comment. Asking for a friend, how did you manage to have the em dash — in your keyboard ?
Does your friend have an iPhone? The default iOS keyboard has automatically converted double dashes into an emdash for at least seven years now.
I think Google docs does this too, which drives me up the wall when I'm trying to write `command --foo=bar` and it turns it into an M-dash which obviously doesn't work.
https://joeldueck.com/manually-type-punctuation.html
https://joeldueck.com/ai-is-right-about-em-dashes.html
On a Mac, it's alt-dash in case you weren't being facetious
Extra pedantic: that’s the en dash, the em dash is option-shift-hyphen
Technically option-shift-dash. option-dash is an en-dash.
Em dashes are used often by LLMs, because humans use them often. On mac keyboards its easily typed. I know this is oversimplifying the situation, but I don't see the usefulness of the constant witch-hunting for allegedly LLM-generated text. For text we are long beyond the point, where we can differenciate between human generated and machine generated. We're even at the point, where it gets somewhat hard to identify machine generated audio and visuals.
I might not be able to spot ALL AI generated text, but I can definitely spot some. It's still kind of quirky.
Yeah, I agree with you. I'm so tired of people complaining about AI-generated text without focusing on the content. Just don't read it if you don't like it. It's another level of when people complain how a website is not readable for them or some CSS rendering is wrong or whatever. How does it add to the discussion?
The problem is that there’s infinite “content” out there.
The amount of work the author puts in is correlated with the value of the piece (insight/novelty/etc). AI-written text is a signal that there’s less less effort and therefore less value there.
It’s not a perfect correlation and there are lots of exceptions like foreign language speakers, but it is a signal.
On Windows it is Alt+0151. Harder to use than on Mac but definitely possible, I frequently use it.
On recent versions Shift+Win+- also work, and Win+- produces en dash.
I just type -- and jira fixes it.
I use Compose - - - on Linux and my cellphone (Unexpected Keyboard). Mac is Alt-_.
I really despise that people like you ruined em dashes for the rest of us who have enjoyed using them.
Honestly responses like this should just be straight blocked by the moderators. They are so super lame and go directly against the rules.