Hacker News

There's no reward function in the sense that optimizing the reward function means the solution is ideal.

There are objective criteria like 'compiles correctly' and 'passes self-designed tests' and 'is interpreted as correct by another LLM instance' which go a lot further than criteria that could be defined for most kinds of verbal questions.