Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

I think people are misunderstanding reward functions and LLMs.

LLMs don't actually have a reward system like some other ML models.

They are trained with one, and when you look at DPO you can say they contain an implicit one as well.

even in one agent, a different starting prompt will have you tracing a very different path through the model.

maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct