Outside of games and coding generating enough valid examples and counter-examples to harness the power of RL is cost prohibitive.
Outside of games and coding generating enough valid examples and counter-examples to harness the power of RL is cost prohibitive.
Which is why rubrics as rewards are used.