Outside of games and coding generating enough valid examples and counter-examples to harness the power of RL is cost prohibitive.

Which is why rubrics as rewards are used.