Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.
I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx
What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.
Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.
Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.
“Anymore?” After 40 years in software I’ll say that validation of intent vs. outcome has always been a hard problem. There are and have been no shortcuts other than determined human effort.
I don’t disagree. After decades, it’s still hard which is exactly why I think treating validation as a system problem matters.
We’ve spent years systematizing generation, testing, and deployment. Validation largely hasn’t changed, even as the surface area has exploded. My interest is in making that human effort composable and inspectable, not pretending it can be eliminated.
But, is that different from how we already work with humans? Typically we don't let people commit whatever code they want just because they're human. It's more than just code reviews. We have design reviews, sometimes people pair program, there are unit tests and end-to-end tests and all kinds of tests, then code review, continuous integration, Q&A. We have systems to watch prod for errors or user complaints or cost/performance problems. We have this whole toolkit of process and techniques to try to get reliable programs out of what you must admit are unreliable programmers.
The question isn't whether agentic coders are perfect. Actually it isn't even whether they're better than humans. It's whether they're a net positive contribution. If you turn them loose in that kind of system, surrounded by checks and balances, does the system tend to accumulate bugs or remove them? Does it converge on high or low quality?
I think the answer as of Opus 4.5 or so is that they're a slight net positive and it converges on quality. You can set up the system and kind of supervise from a distance and they keep things under control. They tend to do the right thing. I think that's what they're saying in this article.
This is what we're working on at Speedscale. Our methods use traffic capture and replay to validate what worked before still works today.
It's simple: you just offload the validation and security testing to the end user.
This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).
And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.
> “define the spec concretely“
(and unambiguously. and completely. For various depths of those)
This always has been the crux of programming. Just has been drowned in closer-to-the-machine more-deterministic verbosities, be it assembly, C, prolog, js, python, html, what-have-you
There have been a never ending attempts to reduce that to more away-from-machine representation. Low-code/no-code (anyone remember Last-one for Apple ][ ?), interpreting-and/or-generating-off DSLs of various level of abstraction, further to esperanto-like artificial reduced-ambiguity languages... some even english-like..
For some domains, above worked/works - and the (business)-analysts became new programmers. Some companies have such internal languages. For most others, not really. And not that long ago, the SW-Engineer job was called Analyst-programmer.
But still, the frontier is there to cross..
Code is always the final spec. Maybe the "no engineers/coders/programmers" dream will come true, but in the end, the soft, wish-like, very undetailed business "spec" has to be transformed into hard implementation that covers all (well, most of) corners. Maybe when context size reaches 1G tokens and memory won't be wiped every new session? Maybe after two or three breakthrough papers? For now, the frontier isn't reached.
The thing is, it doesn’t matter how large the context gets, for a spec to cover all implementation details, it has to be at least as complex as the code.
That can’t ever change.
And if the spec is as complex as the code, it’s not meaningfully easier to work with the spec vs the code.
AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.
The code itself does not matter. If the tests pass, and the tests are good, then who cares? AI will be maintaining the code.
Next iterations of models will have to deal with that code, and it would be harder and harder to fix bugs and introduce features without triggering or introducing more defects.
Biological evolution overcomes this by running thousands and millions of variations in parallel, and letting the more defective ones to crash and die. In software ecosystems, we can't afford such a luxury.
Tests don't cover everything. Performance? Edge cases? Optimization of resource usage are not tipically covered by tests.
Humans not caring about performance is so common we have Wirth's law
But now the clankers are coming for our jobs suddenly we're optimization specialists
It’s not about optimizing for performance, it’s about non-deterministic performance between “compiler” runs.
The ideal that spec driven developers are pushing towards is that you’d check in the spec not the code. Anytime you need the code you’d just regenerate it. The problem is different models, different runs of the same model, and slightly different specs will produce radically different code.
It’s one thing when your program is slow, it’s something completely different when your program performance varies wildly between deployments.
This problem isn’t limited to performance, it’s every implicit implementation detail not captured in the spec. And it’s impossible to capture every implementation detail in the spec without the spec being as complex as the code.
I made a very similar comment to this just today: https://news.ycombinator.com/item?id=46925036
I agree, and I didn't even fully consider "recompiling" would change important implementation details. Oh god
This seems like an impossible problem to solve? Either we specify every little detail, or AI reads our minds
An example: it had a complete interface to a hash map. The task was to delete elements. Instead of using the hash map API, it iterated through the entire underlying array to remove a single entry. The expected solution was O(1), but it implemented O(n). These decisions compound. The software may technically work, but the user experience suffers.
If you have particular performance requirements like that, then include them. Test for them. You still don’t have to actually look at the code. Either the software meets expectations or it doesn’t, and keep having AI work at it until you’re satisfied.
How deep do you want to go? Because reasonable person wouldn't have expected to hand hold AI(ntelligence) to that level. Of course after pointing it out, it has corrected itself. But that involved looking at the code and knowing the code is poor. If you don't look at the code how would you know to state this requirement? Somehow you have to assess the level of intelligence you are dealing with.
Since the code does not matter, you wouldn’t need or want to phrase it in terms of algorithmic complexity. You surely would have a more real world requirement, like, if the data set has X elements then it should be processed within Y milliseconds. The AI is free to implement that however it likes.
Even if you specify performance ranges for every individual operation, you can’t specify all possible interactions between operations.
If you don’t care about the code you’re not checking in the code, and every time you regenerate the code you’re going to get radically different system performance.
Say you have 2 operations that access some data and you specify that each can’t take more than 1ms. Independently they work fine, but when a user runs B then A immediately, there’s some cache thrashing that happens that causes them to both time out. But this only happens in some builds because sometimes your LLM uses a different algorithm.
This kind of thing can happen with normal human software development of course, but constantly shifting implementations that “no one cares about” are going to make stuff like this happen much more often.
There’s already plenty of non determinism and chaos in software, adding an extra layer of it is going to be a nightmare.
The same thing is true for every single implementation detail that isn’t in the spec. In a complex system even implementation details you don’t think you care about become important when they are constantly shifting.
That's assuming no human would ever go near the code, and that over time it's not getting out of hand (inference time, token limits are all a thing), and that anti-patterns don't get to where the code is a logical mess which produces bugs through a webbing of specific behaviors instead of proper architecture.
However I guess that at least some of that can be mitigated by distilling out a system description and then running agents again to refactor the entire thing.
And that is the right assumption. Why would any humans need (or even want) to look at code any more? That’s like saying you want to go manually inspect the oil refinery every time you fill your car up with gas. Absurd.
> However I guess that at least some of that can be mitigated by distilling out a system description and then running agents again to refactor the entire thing.
The problem with this is that the code is the spec. There are 1000 times more decisions made in the implementation details than are ever going to be recorded in a test suite or a spec.
The only way for that to work differently is if the spec is as complex as the code and at that level what’s the point.
With what you’re describing, every time you regenerate the whole thing you’re going to get different behavior, which is just madness.
did you read the article?
>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).
Tests are only rigorous if the correct intent is encoded in them. Perfectly working software can be wrong if the intent was inferred incorrectly. I leverage BDD heavily, and there a lot of little details it's possible to misinterpret going from spec -> code. If the spec was sufficient to fully specify the program, it would be the program, so there's lots of room for error in the transformation.
Then I disagree with you
> You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.
You don't need a human who knows the system to validate it if you trust the LLM to do the scenario testing correctly. And from my experience, it is very trustable in these aspects.
Can you detail a scenario by which an LLM can get the scenario wrong?
I do not trust the LLM to do it correctly. We do not have the same experience with them, and should not assume everyone does. To me, your question makes no sense to ask.
We should be able to measure this. I think verifying things is something an llm can do better than a human.
You and I disagree on this specific point.
Edit: I find your comment a bit distasteful. If you can provide a scenario where it can get it incorrect, that’s a good discussion point. I don’t see many places where LLMs can’t verify as good as humans. If I developed a new business logic like - users from country X should not be able to use this feature - LLM can very easily verify this by generating its own sample api call and checking the response.
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language. Having it write tests doesn't change this, it's only asserting that its view of what you want is internally consistent, it is still just as likely to be an incorrect interpretation of your intent.
Have you worked in software long? I've been in eng for almost 30 years, started in EE. Can confidently say you can't trust the humans either. SWEs have been wrong over and over. No reason to listen now.
Just a few years ago code gen LLMs were impossible to SWEs. In the 00s SWEs were certain no business would trust their data to the cloud.
OS and browsers are bloated messes, insecure to the core. Web apps are similarly just giant string mangling disasters.
SWEs have memorized endless amount of nonsense about their role to keep their jobs. You all have tons to say about software but little idea what's salient and just memorized nonsense parroted on the job all the time.
Most SWEs are engaged in labor role-play, there to earn nation state scrip for food/shelter.
I look forward to the end of the most inane era of human "engineering" ever.
Everything software can be whittled down to geometry generation and presentation, even text. End users can label outputs mechanical turk style and apply whatever syntax they want, while the machine itself handles arithemtic and Boolean logic against memory, and syncs output to the display.
All the linguist gibberish in the typical software stack will be compressed[1] away, all the SWE middlemen unemployed.
Rotary phone assembly workers have a support group for you all.
[1] https://arxiv.org/abs/2309.10668
The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language.
Then it seems like the only workable solution from your perspective is a solo member team working on a product they came up with. Because as soon as there's more than one person on something, they have to use "lossy natural language" to communicate it between themselves.
Coworkers are absolutely an ongoing point of friction everywhere :)
On the plus side, IMO nonverbal cues make it way easier to tell when a human doesn't understand things than an agent.
>> The whole point is that you can't 100% trust the LLM to infer your intent with accuracy from lossy natural language.
You can't 100% trust a human either.
But, as with self-driving, the LLM simply needs to be better. It does not need to be perfect.
> You can't 100% trust a human either.
We do have a system of checks and balances that does a reasonable job of it. Not everyone in position of power is willing to burn their reputation and land in jail. You don't check the food at the restaurant for poison, nor check the gas in your tank if it's ok. But you would if the cook or the gas manufacturer was as reliable as current LLMs.
Good analogy
> If the spec was sufficient to fully specify the program, it would be the program
Very salient concept in regards to LLM's and the idea that one can encode a program one wishes to see output in natural English language input. There's lots of room for error in all of these LLM transformations for same reason.