Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.
Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.
I think people are misunderstanding reward functions and LLMs.
LLMs don't actually have a reward system like some other ML models.
They are trained with one, and when you look at DPO you can say they contain an implicit one as well.
even in one agent, a different starting prompt will have you tracing a very different path through the model.
maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct
It’s superstition that using a different slop generator to “review” the slop from a different brand of slop generator somehow makes things better. It’s slop all the way down.
https://github.com/karpathy/llm-council
https://ui.adsabs.harvard.edu/abs/2025arXiv250214815C/abstra...
https://www.arxiv.org/abs/2509.23537
https://www.aristeidispanos.com/publication/panos2025multiag...
https://arxiv.org/abs/2305.14325
https://arxiv.org/abs/2306.05685
https://arxiv.org/abs/2310.19740v1