Translating from a natural language spec to code involves a truly massive amount of decision making.
For a non trivial program, 2 implementations of the same natural language spec will have thousands of observable differences.
Where we are today, that is agents require guardrails to keep from spinning out, there is no way to let agents work on code autonomously that won’t end up with all of those observable differences constantly shifting, resulting in unusable software.
Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.
The only solution to this problem is that LLMs get better. Personally I think at the point they can pull this off, they can do any white collar job, and there’s not point in planning for that future because it results in either Mad Mad or Star Trek.
> Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.
I don't think "complex" is the right word here. A test suite would generally be more verbose than the implementation, but a lot of the time it can simply be a long list of input->output pairs that are individually very comprehensible and easily reviewable to a human. The hard part is usually discovering what isn't covered by the test case, rather than validating the correctness of the test cases you do have.
Code is like f(x)=ax+b. You test would be a list of (x,y) tuple. You don’t verify the correctness of your points because they come from some source that you hold as true. What you want is the generic solution (the theory) proposed by the formula. And your test would be just a small set of points, mostly to ensure that no one has changed the a and b parameters. But if you have a finite number of points, The AI is more likely to give you a complicated spline formula than the simple formula above. Unless the tokens in the prompts push it to the right domain space. (Usually meaning that the problem is solved already)
Real code has more dimensionality than the above example. Experts have the right keywords, but even then that’s a whole of dice. And coming up with enough sample test cases is more arduous than writing the implementation.
Unless there’s no real solution (dimensionality is high), but we have a lot of tests data with a lower dimensionality than the problem. This used to be called machine learning and we have metrics like accuracy for it.
If some of those input-output pairs are the result of a different interpretation of the spec to other input-output pairs, it's possible that no program satisfies all the tests (or, worse, that a program that satisfies all the tests isn't correct).
At some point verbosity becomes complexity. If you’re talking all observable behavior the input and output pairs are likely to be quite verbose/complex.
Imagine testing a game where the inputs are the possible states of game, and the possible control inputs, and the outputs are the states that could result.
Of course very few human written programs require this level of testing, but if you are trying to prevent an a swarm of agents from changing observable behavior without human review, that’s what you’d need.
Even with simpler input output pairs, an AI tells you it added a feature and had to change 2,000 input/output pairs to do so. How do you verify that those were necessary to change, and how do you verify that you actually have enough cases to prevent the AI from doing something dumb?
Oops you didn’t have a test that said that items shouldn’t turn completely transparent when you drag them.
>For a non trivial program, 2 implementations of the same natural language spec will have thousands of observable differences.
If they're not defined in the spec then these differences shouldn't matter, they're just implementation details. And if they do matter, then they should be included in the spec; a natural language spec that doesn't specify some things that should be specified is not a good spec.
> we just need to make the spec perfect
So, never.
Greg Kroah-Hartman was once asked by his boss, ”when will Linux be done?” and he said, ”when people stop making new hardware”, that even today, when we assume the hardware won’t lie, much of the work in maintaining Linux is around hardware bugs.
So even at the lowest levels of software development, you can’t know the bugs you’re going to have until you partially solve the problem and find out that this combination of hardware and drivers produces an error, and you only find that out because someone with that combination tried it. There is no way to prevent that by “make better spec”.
But that’s always been true. Basically it’s the 3-body-problem. On the spectrum of simple-complicated-complex, you can calculate the future state of a system if it’s simple, or “only complicated” (sometimes), but you literally cannot know the future state of complex systems without simulating them, running each step and finding out.
And it gets worse. Software ranges from simple to complicated to complex. But it exists within a complex hardware environment, and also within a complex business environment where people change and interest rates change and motives change from month to month.
There is no “correct spec”.
There are a limitless number of implementation details you don't think you care about until they are constantly changing.
I doubt there exists a single piece of nontrivial software today where you could randomly alter 5% of the implementation details while keeping to the spec, without resulting in a flood of support tickets.
Agreed, but with one exception: are tests supposed to cover all observable behavior? Usually people are happy with just eliminating large/easy classes of bad (unintended) behavior, otherwise they go for formal verification which is an entirely different ballgame.
No they aren’t because they can’t (at least not without becoming so complicated that there’s no longer a point).
But humans are much better at reasoning about whether a change is going to impact observable behavior than current LLMs are as evidenced by the fact that LLMs require a test suite or something similar to build a working app longer than a few thousand lines.