Perhaps not entirely open domain, but I have high hopes for “real RL” in coding, where you can get a reward signal from compile/runtime errors and tests.
Perhaps not entirely open domain, but I have high hopes for “real RL” in coding, where you can get a reward signal from compile/runtime errors and tests.
Interesting, has anyone been doing this? I.e. training/fine-tuning an LLM against an actual coding environment, as opposed to just tacking that later on as a separate "agentic" contruct?
I suspect that the big vendors are already doing it, but I haven’t seen a paper on it.