Does programming have a clear reward function? A vague description from a business person is not it. By the time someone (a programmer?) has written a reward function that is clear enough, how would that function look compared to a program?

Exactly, and people have been saying this for a while now. If an "AI software engineer" needs a perfect spec with zero ambiguity, all edge cases defined, full test coverage with desired outcomes etc., then the person writing the spec is the actual software engineer, and the AI is just a compiler.

We’ve also learned that starting off by rigidly defined spec is actually harmful to most user facing software, since customers change their minds so often and have a hard time knowing what they want right from the start.

This is why most of the best software is written by people writing things for themselves and most of the worst is made by people making software they don't use themselves.

True facts: half of the self made software are task trackers.

Sure, and the most performed song in the world is probably hot cross buns or Mary had a little lamb.

Exactly. This is what I tell everyone. The harder you work on specs the easier it gets in the aftermath. And this is exactly what business with lofty goals doesn’t get or ignores. Put another way: a fool with a tool…

Also look out for optimization the clever way.

This is not quite right - a specification is not equivalent to writing software, and the code generator is not just a compiler - in fact, generating implementations from specifications is a pretty active area of research (a simpler problem is the problem of generating a configuration that satisfies some specification, "configuration synthesis").

In general, implementations can be vastly more complicated than even a complicated spec (e.g. by having to deal with real-world network failures, etc.), whereas a spec needs only to describe the expected behavior.

In this context, this is actually super useful, since defining the problem (writing a spec) is usually easier than solving the problem (writing an implementation); it's not just translating (compiling), and the engineer is now thinking at a higher level of abstraction (what do I want it to do vs. how do I do it).

Surely a well written spec would include functional requirements like resilience and performance?

However I agree that's the hard part. I can write a spec for finding the optimal solution to some combinatorial problem - where the naive code is trivial - a simple recursive function for example - but such a function would use near infinite time and memory.

In terms of the ML programme really being a compiler - isn't that in the end true - the ML model is a computer programme taking a spec as input and generating code as output. Sounds like a compiler to me.

I think the point of the AK post is to say the challenge is in the judging of solutions - not the bit in the middle.

So to take the writing software problem - if we had already sorted the computer programme validation problem there wouldn't be any bugs right now - irrespective of how the code was generated.

The point was specifically that that obvious intuition is wrong, or at best incomplete and simplistic.

You haven't disproved this idea, merely re-stated the default obvious intuition that everyone is expected to have before being presented with this idea.

Their point is correct that defining a spec rigorously enough IS the actual engineering work.

A c or go program is nothing else but a spec which the compiler impliments.

There are infinite ways to impliment a given c expression in assembly, and doing that is engineering and requires a human to do it, but only once. The compiler doesn't invent how to do it every time the way a human would, the compiler author picked a way and now the compiler does that every time.

And it gets more complex where there isn't just one way to do things but several and the compiler actually chooses from many methods best fit in different contexts, but all of that logic is also written by some engineer one time.

But now that IS what happens, the compiler does it.

A software engineer no longer writes in assembly, they write in c or go or whatever.

I say I want a function that accepts a couple arguments and returns a result of a math formula, and it just happens. I have no idea how the machine actually impliments it, I just wrote a line of algebra in a particular formal style. It could have come right out of a pure math textbook and the valid c function definition syntax could just as well be pseudocode to describe a pure math idea.

If you tell an ai, or a human programmer for that matter, what you want in a rigorous enough format that all questions are answered, such that it doesn't matter what language the programmer uses or how the programmer impliments it, then you my friend have written the program, and are the programmer. The ai, or the human who translated that into some other language were indeed just the compiler.

It doesn't matter that there are multiple ways to impliment the idea.

It's true that one programmer writes a very inefficient loop that walks an entire array once for every element in the array, while another comes up with some more sophisticated index or vector or math trick approach, but that's not the definition of anything.

There are both simple and sophisticated compilers. You can already right now feed the the same c code into different compilers and get results that all work, but one is 100x faster than another, one uses 100x less ram than another, etc.

If you give a high level imprecise directive to an ai, you are not programming. If you give a high level precise directive to an ai, you are programming.

The language doesn't matter. What matters is what you express.

What makes you think they'll need a perfect spec?

Why do you think they would need a more defined spec than a human?

A human has the ability to contact the PM and say, "This won't work, for $reason," or, "This is going to look really bad in $edgeCase, here are a couple options I've thought of."

There's nothing about AI that makes such operations intrinsically impossible, but they require much more than just the ability to generate working code.

A human needs a perfect spec too.

Anything you don't define, is literally undefined behavior the same as in a compiler. The human will do something, and maybe you like it and maybe you don't.

A perfect spec is just another way to dedcribe a formal language, ie any programming language.

If you don't care what you get, then say little and say it ambiguously and pull the slot machine lever.

If you care what you get then you don't necessarily have to say a lot but you have to remove ambiguity, and then what you have is a spec, and if it's rigorous enough, it's a program, regardless what language and syntax is used to express it.

I think the difference is that with a human you can say something ambiguous like "handle error cases" and they are going to put thought into the errors that come up. The LLM will just translate those tokens into if statements that do some validation and check return values after calls. The depth of thought is very different.

But that is just a difference of degree, not of kind.

There is a difference between a human and an ai, and it is more than a difference of degrree, but filling in gaps with something that fits is not very significant. That can be done perfectly mechanistically.

Reminds me of when computers were literally humans computing things (often women). How time weaves its circular web.

> then the person writing the spec is the actual software engineer

Sounds like this work would involve asking questions to collaborators, guess some missing answers, write specs and repeat. Not that far ahead of the current sota of AI...

Same reason the visual programming paradigm failed, tbe main problem is not the code.

While writing simple functions may be mechanistic, being a developer is not.

'guess some missing answers' is why Waterfall, or any big upfront design has failed.

People aren't simply loading pig iron into rail cars like Taylor assumed.

The assumption of perfect central design with perfect knowledge and perfect execution simply doesn't work for systems which are for more like an organism than a machine.

Waterfall fails when domain knowledge is missing. Engineers won't take "obvious" problems into consideration when they don't even know what the right questions to ask are. When a system gets rebuild for the 3rd time the engineers do know what to build and those basic mistakes don't get made.

Next gen LLMs, with their encyclopedic knowledge about the world, won't have that problem. They'll get the design correct on their first attempt because they're already familiar with the common pitfalls.

Of course we shouldn't expect LLMs to be a magic bullet that can program anything. But if your frame of reference is "visual programming" where the goal is to turn poorly thought out requirements into a reasonably sensible state machine then we should expect LLMs to get very good at that compared to regular people.

LLMs are NLP, what you are talking about is NLU, which has been considered an AI-hard problem for a long time.

I keep looking for discoveries that show any movement there. But LLMs are still basically pattern matching and finding.

They can do impressive things, but they actually have no concept of what the 'right thing' even is, it is statistic not philosophy.

I mean, that's already the case in many places, the senior engineer / team lead gathering requirements and making architecture decisions is removing enough ambiguity to hand it off to juniors churning out the code. This just makes very cheap, very fast typing but uncreative and a little dull junior developers.

Programming has a clear reward function when the problem being solving is well-specified, e.g., "we need a program that grabs data from these three endpoints, combines their data in this manner, and returns it in this JSON format."

The same is true for math. There is a clear reward function when the goal is well-specified, e.g., "we need a sequence of mathematical statements that prove this other important mathematical statement is true."

I’m not sure I would agree. By the time you’ve written a full spec for it, you may as well have just written a high level programming language anyway. You can make assumptions that minimise the spec needed… but also programming APIs can have defaults so that’s no advantage.

I’d suggest that the Python code for your example prompt with reasonable defaults is not actually that far from the prompt itself in terms of the time necessary to write it.

However, add tricky details like how you want to handle connection pooling, differing retry strategies, short circuiting based on one of the results, business logic in the data combination step, and suddenly you’ve got a whole design doc in your prompt and you need a senior engineer with good written comms skills to get it to work.

> I’m not sure I would agree. By the time you’ve written a full spec for it, you may as well have just written a high level programming language anyway.

Remember all those attempts to transform UML into code back in the day? This sounds sorta like that. I’m not a total genai naysayer but definitely in the “cautiously curious” camp.

Absolutely, we've tried lots of ways to formalise software specification and remove or minimise the amount of coding, and almost none of it has stuck other than creating high level languages and better code-level abstractions.

I think generative AI is already a "really good autocomplete" and will get better in that respect, I can even see it generating good starting points, but I don't think in its current form it will replace the act of programming.

Thanks. I view your comment as orthogonal to mine, because I didn't say anything about how easy or hard it would be for human beings to specify the problems that must be solved. Some problems may be easy to specify, others may be hard.

I feel we're looking at the need for a measure of the computational complexity of problem specifications -- something like Kolmogorov complexity, i.e., minimum number of bits required, but for specifying instead of solving problems.

Apologies, I guess I agree with your sentiment but disagree with the example you gave as I don't think it's well specified, and my more general point is that there isn't an effective specification, which means that in practice there isn't a clear reward function. If we can get the clear specification, which we probably can do proportionally to the complexity of the problem, and not getting very far up the complexity curve, then I would agree we can get the good reward function.

> the example you gave

Ah, got it. I was just trying to keep my comment short!

Yeah, an LLM applied to converting design docs to programs seems like, essentially, the invention of an extremely high level programming language. Specifying the behavior of the program in sufficient detail is… why we have programming languages.

There’s the task of writing syntax, which is the mechanical overhead of the task of telling the computer what to do. People should focus on the latter (too much code is a symptom of insufficient automation or abstraction). Thankfully lots of people have CS degrees, not “syntax studies” degrees, right?

How about you want to solve sudoku say.And you simply specify that you want the output to have unique numbers in each row, unique numbers in each column, and no unique number in any 3x3 grid.

I feel like this is a very different type of programming, even if in some cases it would wind up being the same thing.

  >when the problem being solving is well-specified
Phew! Sounds like i'll be fine, thank god for product owners.

20 years, number of "well specified" requirements documents I've received: 0.

[deleted]

> programming has a clear reward function.

If you’re the most junior level, sure.

Anything above that, things get fuzzy, requirements change, biz goals shift.

I don’t really see this current wave of AI giving us anything much better than incremental improvement over copilot.

A small example of what I mean:

These systems are statistically based, so there’s no probability. Because of that, I wouldn’t even gain anything from having it write my tests since tests are easily built wrong in subtle ways.

I’d need to verify the test by reviewing it and, imo, writing the test would be less time than coaxing a correct one, reviewing, re-coaxing, repeat.

This could make programming more declarative or constraint-based, but you'd still have to specify the properties you want. Ultimately, if you are defining some function in the mathematical sense, you need to say somehow what inputs go to what outputs. You need to communicate that to the computer, and a certain number of bits will be needed to do that. Of course, if you have a good statistical model of how-probably a human wants a given function f, then you can perform that communication to the machine in 1/log(P(f)) bits, so the model isn't worthless.

Here I have assumed something about the set that f lives in. I am taking for granted that a probability measure can be defined. In theory, perhaps there are difficulties involving the various weird infinities that show up in computing, related to undecideability and incompleteness and such. But at a practical level, if we assume some concrete representation of the program then we can just define that it is smaller than some given bound, and ditto for a number of computational steps with a particular model of machine (even if fairly abstract, like some lambda calculus thing), so realistically we might be able to not worry about it.

Also, since our input and output sets are bounded (say, so many 64-bit doubles in, so many out), that also gives you a finite set of functions in principle; just think of the size of the (impossibly large) lookup table you'd need to represent it.

> Programming has a clear reward function when the problem being solving is well-specified

the reason why we spend time programming is because the problems in question are not easily defined, let alone the solutions.

A couple of problems that is impossible to prove from the constructivism angle:

1) Addition of the natural numbers 2) equality of two real numbers

When you restrict your tools to perceptron based feed forward networks with high parallelism and no real access to 'common knowledge', the solution set is very restricted.

Basically what Gödel proved that destroyed Russel's plans for the Mathmatica Principia applies here.

Programmers can decide what is sufficient if not perfect in models.

can you give an example of what "in this manner" might be?

Very good point. For some types of problems maybe the answer is yes. For example porting. The reward function is testing it behaves the same in the new language as the old one. Tricky for apps with a gui but doesn't seem impossible.

The interesting kind of programming is the kind where I'm figuring out what I'm building as part of the process.

Maybe AI will soon be superhuman in all the situations where we know exactly what we want (win the game), but not in the areas we don't. I find that kind of cool.

Even for porting there's a bit of ambiguity... Do you port line-for-line or do you adopt idioms of the target language? Do you port bug-for-bug as well as feature-for-feature? Do you leave yet-unused abstractions and opportunities for expansion that the original had coded in, if they're not yet used, and the target language code is much simpler without?

I've found when porting that the answers to these are sometimes not universal for a codebase, but rather you are best served considering case-by-case inside the code.

Although I suppose an AI agent could be created that holds a conversation with you and presents the options and acts accordingly.

Full circle but instead of determinism you introduce some randomness. Not good.

Also the reasoning is something business is dissonant about. The majority of planning and execution teams stick to processes. I see way more potential automating these than all parts in app production.

Business is going to have a hard time, when they believe, they alone can orchestrate some AI consoles.

[deleted]

“A precise enough specification is already code”, which means we'll not run out of developers in the short term. But the day to day job is going to be very different, maybe as different as what we're doing now compared to writing machine code on punchcards.

Doubtful. This is the same mess we've been in repeatedly with 'low code'/'no code' solutions.

Every decade it's 'we don't need programmers anymore'. Then it turns out specifying the problem needs programmers. Then it turns out the auto-coder can only reach a certain level of complexity. Then you've got real programmers modifying over-complicared code. Then everyone realizes they've wasted millions and it would have been quicker and cheaper to get the programmers to write the code in the first place.

The same will almost certainly happen with AI generated code for the next decade or two, just at a slightly higher level of program complexity.

> Every decade it's 'we don't need programmers anymore'. Then it turns out specifying the problem needs programmers.

I literally refuted this in my comment…

That being said, some kind of “no-code” is not necessarily a bad idea, as long as you treat it as just an abstraction for people who actually are programmers, like C versus assembly, or high level languages vs C.

In fact I worked for a train manufacturer that had a cool “no code” tool to program automated train control software with automated theorem proving built in, and it was much more efficient than there former Ada implementation especially when you factor the hiring difficulties in.

their*

There's levels to this.

Certainly "compiled" is one reward (although a blank file fits that...) Another is test cases, input and output. This doesn't work on a software-wide scale but function-wide it can work.

In the future I think we'll see more of this test-driven development. Where developers formally define the requirements and expectations of a system and then an LLM (combined with other tools) generates the implementation. So instead of making the implementation, you just declaratively say what the implementation should do (and shouldn't).

I think you could set up a good reward function for a programming assistance AI by checking that the resulting code is actually used. Flag or just 'git blame' the code produced by the AI with the prompts used to produce it, and when you push a release, it can check which outputs were retained in production code from which prompts. Hard to say whether code that needed edits was because the prompt was bad or because the code was bad, but at least you can get positive feedback when a good prompt resulted in good code.

GitHub Copilot's telemetry does collect data on whether generated code snippets end up staying in the code, so presumably models are tuned on this feedback. But you haven't solved any of the problems set out by Karpathy here—this is just bankshot RLHF.

That could be interesting but it does seem like a much fuzzier and slower feedback loop than the original idea.

It also seems less unique to code. You could also have a chat bot write an encyclopedia and see if the encyclopedias sold well. Chat bots could edit Wikipedia and see if their edits stuck as a reward function (seems ethically pretty questionable or at least in need of ethical analysis, but it is possible).

The maybe-easy to evaluate reward function is an interesting aspect of code (which isn’t to say it is the only interesting aspect, for sure!)

> Does programming have a clear reward function? A vague description from a business person isn't it. By the time someone (a programmer?) has written a reward function that is clear enough, how would that function look compared to a program?

Well, to give an example: the complexity class NP is all about problems that have quick and simple verification, but finding solutions for many problems is still famously hard.

So there are at least some domains where this model would be a step forward.

But in that case, finding the solution is hard and you generally don't try. Instead, you try to get fairly close, and it's more difficult to verify that you've done so.

No. Most instances of most NP hard problems are easy to find solutions for. (It's actually really hard to eg construct a hard instance for the knapsack problem. And SAT solvers also tend to be really fast in practice.)

And in any case, there are plenty of problems in NP that are not NP hard, too.

Yes, approximation is also an important aspect of many practical problems.

There's also lots of problems where you can easily specify one direction of processing, but it's hard to figure out how to undo that transformation. So you can get plenty of training data.

I have a very simple integer linear program and it is really waiting for the heat death of the universe.

No, running it as a linear program is still slow.

I'm talking about small n=50 taking tens of minutes for a trivial linear program. Obviously the actual linear program is much bigger and scales quadratically in size, but still. N=50 is nothing.

Yes, there are also instances of problems in NP that are hard to solve in practice.

But here again: solutions to your problem are easy to verify, so it might be interesting to let an AI have a go at solving it.

If we will struggle to create reward functions for AI, then how different is that from the struggles we already face when divvying up product goals into small tasks to fit our development cycles?

In other words, to what extent does Agile's ubiquity prove our competence in turning product goals into de facto reward functions?

There's no reward function in the sense that optimizing the reward function means the solution is ideal.

There are objective criteria like 'compiles correctly' and 'passes self-designed tests' and 'is interpreted as correct by another LLM instance' which go a lot further than criteria that could be defined for most kinds of verbal questions.

My reward in Rust is often when the code actually compiles...

If they get permission and don't mind waiting, they could check if people throw away the generated code or keep it as-is.

You can define one based on passed tests, code coverage, other objectives, or weighted combinations without too much loss of generality.

The reward function could be "pass all of these tests I just wrote".

Lol. Literally.

If you have those many well written tests, you can pass them to a constraint solver today and get your program. No LLM needed.

Or even run your tests instead of the program.

Probably the parent assumes that he does have the tests, billions of them.

One very strong LLM could generate billions of tests alongside the working code and then train another smaller model, or feed it into the next iteration of training same the strong model. Strong LLMs do exist for that purpose, Nemotron 320B and Llama 3 450B.

It would be interesting if a dataset like that would be created like that, and then released as open source. Many LLMs proprietary or not, could incorporate the dataset in their training, and have on the internet hundreds of LLMs suddenly become much better at coding, all of them at once.

You cannot

After much RL, the model will just learn to mock everything to get the test to pass.

+1

Much business logic is really just a state machine where all the states and all the transitions need to be handled. When a state or transition is under-specified an LLM can pass the ball back and just ask what should happen when A and B but not C. Or follow more vague guidance on what should happen in edge cases. A typical business person is perfectly capable of describing how invoicing should work and when refunds should be issued, but very few business people can write a few thousand lines of code that covers all the cases.

> an LLM can pass the ball back and just ask what should happen when A and B but not C

What should the colleagues of the business person review before deciding that the system is fit for purpose? Or what should they review when the system fails? Should they go back over the transcript of the conversation with the LLM?

As an LLM can output source code, that's all answerable with "exactly what they already do when talking to developers".

There are two reasons the system might fail:

1) The business person made a mistake in their conversation/specification.

In this case the LLM will have generated code and tests that match the mistake. So all the tests will pass. The best way to catch this before it gets to production is to have someone else review the specification. But the problem is that the specification is a long trial-and-error conversation in which later parts may contradict earlier parts. Good luck reviewing that.

2) The LLM made a mistake.

The LLM may have made the mistake because of a hallucination which it cannot correct because in trying to correct it the same hallucination invalidates the correction. At this point someone has to debug the system. But we got rid of all the programmers.

This still resolves as "business person asks for code, business person gets code, business person says if code is good or not, business person deploys code".

That an LLM or a human is where the code comes from, doesn't make much difference.

Though it does kinda sound like you're assuming all LLMs must develop with Waterfall? That they can't e.g. use Agile? (Or am I reading too much into that?)

> business person says if code is good or not

How do they do this? They can't trust the tests because the tests were also developed by the LLM which is working from incorrect information it received in a chat with the business person.

The same way they already do with humans coders whose unit tests were developed by exactly same flawed processes:

Mediocrely.

Sometimes the current process works, other times the planes fall out of the sky, or updates causes millions of computers to blue screen on startup at the same time.

LLMs in particular, and AI in general, doesn't need to beat humans at the same tasks.

How does a business person today decide if a system is fit for purpose when they can't read code? How is this different?

They don't, the software engineer does that. It is different since LLMs can't test the system like a human can.

Once the system can both test and update the spec etc to fix errors in the spec and build the program and ensure the result is satisfactory, we have AGI. If you argue an AGI could do it, then yeah it could as it can replace humans at everything, the argument was for an AI that isn't yet AGI.

The world runs on fuzzy underspecified processes. On excel sheets and post-it notes. Much of the world's software needs are not sophisticated and don't require extensive testing. It's OK if a human employee is in the loop and has to intervenes sometimes when an AI-built system malfunctions. Businesses of all sizes have procedures where problems get escalated to more senior people with more decision-making power. The world is already resilient against mistakes made by tired/inattentive/unintelligent people, and mistakes made by dumb AI systems will blend right in.

> The world runs on fuzzy underspecified processes. On excel sheets and post-it notes.

Excel sheets are not fuzzy and underspecified.

> It's OK if a human employee is in the loop and has to intervenes sometimes

I've never worked on software where this was OK. In many cases it would have been disastrous. Most of the time a human employee could not fix the problem without understanding the software.

All software that interops with people, other businesses, APIs, deals with the physical world in any way, or handles money has cases that require human intervention. It's 99.9% of software if not more. Security updates. Hardware failures. Unusual sensor inputs. A sudden influx of malformed data. There is no such thing as an entirely autonomous system.

But we're not anywhere close to maximally automated. Today (many? most?) office workers do manual data entry and processing work that requires very little thinking. Even automating just 30% of their daily work is a huge win.