My Weird Hill is that we should be building things with GPT-4.
I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!
I say this because I did!
Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.
Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!
Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)
> My Weird Hill is that we should be building things with GPT-4.
Absolutely. I always advocate that our developers have to test on older / slower machines. That gives them direct (painful) feedback when things run slow. Optimizing whatever you build for an older "something" (LLM model, hardware) will make it excel on more modern somethings.
> semantic
> grep def
Once you get to a codebase beyond a certain size, that no longer works.
I've for one found Serena https://github.com/oraios/serena , which you can install from right within Claude, to be a fairly fantastic code-interaction tool for LLM's. Both semantic search as well as editing. And with way less token churn.
This is an interesting one - thanks for sharing!
The problem with these exercises is always: I have limited time and capacity to do things, and a fairly unlimited number of problems that I can think of to solve. Coding is not a problem I want to solve. Prompt engineering is not a problem I want to solve.
If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.
Well it's an amusing exercise I suppose, if you're into that sort of thing. I certainly enjoy it!
My meaning, rather, is that there's people whose full time job is to build these things who seem to have forgotten what everyone in the field knew 3 years ago.
More likely they think, ahh we don't need that now! These are all solved problems! In my experience, that's not really true. The stuff that worked 3 years ago still works, and much of it works better.
Some of it doesn't work, for example, if the codebase is very large, but that's not difficult to account for. Poking around blindly, I say, should be the fallback in such cases, rather than the default in all of them!
I am in the same boat. I have built bunch of bash/shell scripts in a folder back in 2022/2023. When models first came out, I would prompt them to use subshell syntax to call commands (ie: '$(...)' format)
I would run it via calling AWS Bedrock API through AWS-CLI. Self iterating and simple. All execution history directly embedded within.
Soon after, I wrote a help switch/command to each script. Such that they act as like MCP. To this day, they outperform any prompts one can make.
> Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
> My Weird Hill is that we should be building things with GPT-4.
I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!
To clarify, I didn't mean we should be using ancient models in production, I meant in R&D.
Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?
The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!
As an addendum... The base/text models which have fallen out of style, are also extremely worth learning and working with. Davinci is still online, I believe, although it is deprecated.
Another lost skill! Learning how things were done before instruct tuning forces you to structure things in such a way so the model can't do it wrong. Half a page of well crafted examples can beat 3 pages of confusing rules!
(They're also magical and amazing at writing, although they produce bizarre and horrifying output sometimes.)
> A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
This was a key feature of aider and if you're not inclined to use aider (or the forked version cecli) I think a standalone implementation exist at https://github.com/pdavis68/RepoMapper