Level 4 is where I see the most interesting design decisions get made, and also where most practitioners take a shortcut that compounds badly later.
When the author talks about "codifying" lessons, the instinct for most people is to update the rules file. That works fine for conventions - naming patterns, library preferences, relatively stable stuff. But there's a different category of knowledge that rules files handle poorly: the why behind decisions. Not what approach was chosen, but what was rejected and why the tradeoff landed where it did.
"Never use GraphQL for this service" is a useful rule to have in CLAUDE.md. What's not there: that GraphQL was actually evaluated, got pretty far into prototyping, and was abandoned because the caching layer had been specifically tuned for REST response shapes, and the cost of changing that was higher than the benefit for the team's current scale. The agent follows the rule. It can't tell when the rule is no longer load-bearing.
The place where this reasoning fits most naturally is git history - decisions and rejections captured in commit messages, versioned alongside the code they apply to. Good engineers have always done this informally. The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory.
At level 7, this matters more than people expect. Background agents running across sessions with no human-in-the-loop have nothing to draw on except whatever was written down. A stale rules file in that context doesn't just cause mistakes - it produces confident mistakes.
The "why behind decisions" gap is real. Rules files flatten tradeoffs into mandates. One pattern that helps: treating instructions as typed blocks rather than prose. A `context` block carries the rationale (what was evaluated, what the tradeoffs were), a `constraints` block carries the conclusion. The agent follows rules, but the blocks make it easier to audit which constraints are still load-bearing vs. historical artifacts.
I've been building github.com/Nyrok/flompt around this idea, a visual prompt builder that decomposes prompts into 12 semantic blocks and compiles to Claude-optimized XML. The block separation turns out to be useful exactly for this case: context is not constraints, and they shouldn't live in the same flat text blob.
It is for this reason that I usually keep an "adr" folder in my repo to capture Architecture Decision Record documents in markdown. These allow the agent to get the "why" when it needs to. Useful for humans too.
The challenge is really crafting your main agent prompt such that the agent only reads the ADRs when absolutely necessary. Otherwise they muddy the context for simple inside-the-box tasks.
I had a hunch that this comment was LLM-generated, and the last paragraph confirmed it. Kudos for managing to get so many upvotes though.
"Where most [X] [Y]" is an up and coming LLM trope, which seems to have surfaced fairly recently. I have no idea why, considering most claims of that form are based on no data whatsoever.
It’s still an insightful and well written comment, but the LLM-ness does make me wonder whether this part was actually human-intended or just LLM filler:
> The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory
Because I somewhat agree that discipline may be missing, but I don’t believe it to be a groundbreaking revelation that it’s actually quite easy to tell the LLM to put key reasoning that you give it throughout the conversation into the commits and issue it works on.
Suppose you spend months deeply researching a niche topic. You make your own discoveries, structure your own insights, and feed all of this tightly curated, highly specific context into an LLM. You essentially build a custom knowledge base and train the model on your exact mental framework.
Is this fundamentally different from using a ghostwriter, an editor, or a highly advanced compiler? If I am doing the heavy lifting of context engineering and knowledge discovery, it feels restrictive to say I shouldn't utilize an LLM to structure the final output. Yet, the internet still largely views any AI-generated text as inherently "un-human" or low-effort.
I would ignore any HN content written by a ghost writer or editor. I guess I would flag compiler output but I’m not sure we’re talking about the same thing?
I’m on the internet for human beings. I already read a newspaper for editors and books for ghostwriters.
Not for long though, HN is dying. Just hanging around here waiting for the next thing , I guess…
Sorry man, the internet has died and is not being replaced by anything but an authoritarian nightmare.
My only guess is if you want actual humans, you'll have to do this IRL. Of course we has humans have got used to the 24/7 availability and scale of the internet so this is going to be a problem as these meetings won't provide the hyperactive environment we want.
Any other digital system will be gamed in one way or another.
The problem is: the structure of LLM outputs generally make everything sound profound. It’s very hard to understand quickly whether a comment has actual signal or it’s just well written bullshit.
And because the cost of generating the comments is so low, there’s no longer an implicit stamp of approval from the author. It used to be the case that you could kind of engage with a comment in good faith, because you knew somebody had spent effort creating it so they must believe it’s worth time. Even on a semi-anonymous forum like HN, that used to be a reliable signal.
So a lot of the old heuristics just don’t work on LLM-generated comments, and in my experience 99% of them turn out to be worthless. So the new heuristic is to avoid them and point them out to help others avoid them.
I would much rather just read the prompt.
I hadn't considered this so eloquently with LLM text output, but you're right. "LLMs make everything sound profound" and "well-written bullshit".
This has severe ramifications for internet communications in general on forums like HN and others, where it seems LLM-written comments are sneaking in pretty much everywhere.
It's also very, very dangerous :/ Because the structure of the writing falsely implies authority and trust where there shouldn't be, or where it's not applicable.
The bottleneck isn't agent capability. It's captured institutional knowledge.
I structure system prompts (CLAUDE.md) with verification gates: pre-task checkpoints, approach selection, post-completion rescans. When an agent writes an ADR during a refactor, future agents reference it before touching the same code. The context compounds across sessions.
Commit messages capture what. ADRs capture why. Skill files capture how the team works. That last layer is what most setups miss. The gap isn't level 6 vs 8, it's whether architectural reasoning is machine-readable or trapped in someone's head.
I have a skill and template for adding ADRs to the documentation for this purpose.
A good rule would then be to capture such reasoning, at least when made during the session with the agent, in the commit messages the agent creates.
That’s exactly the direction I went with. Working on a spec for exactly this - planning to post it here soon:
https://github.com/berserkdisruptors/contextual-commits
[dead]