RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?
RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?