Interesting work but this strikes me as a somewhat quixotic fight against inevitable tendencies of statistical models. Reinforcement learning has a single goal, an agreeable mean. Reinforcement learning stops when the LLM produces agreeable responses more often than not, the only way you can achieve absolute certainty here is if you tune it for an infinite amount of time. I also don't see how this method couldn't be subsumed by a simpler method like dynamic temperature adjustment. Transformers are fully capable of generating unpredictable yet semantic text based on a single hyperparameter. Maybe it would make more sense to simply experiment with different temperature settings. Usually it's a fixed value.