Isn't this a good reward function for RL? Take a codebase's test suite. Rip out a function, let the LLM rewrite the function, benchmark it and then RL it using the benchmark results.
Isn't this a good reward function for RL? Take a codebase's test suite. Rip out a function, let the LLM rewrite the function, benchmark it and then RL it using the benchmark results.