"We did notice, and documented in our paper, instances when the DGM hacked its reward function.. To see if DGM could fix this issue.. We created a “tool use hallucination” reward function.. in some cases, it removed the markers we use in the reward function to detect hallucination (despite our explicit instruction not to do so), hacking our hallucination detection function to report false successes."

So, empirical evidence of theoretically postulated phenomena. Seems unsurprising.

Reward hacking is a well known and tracked problem at frontier labs - Claude 4’s system card reports on it for instance. It’s not surprising that a framework built on current llms would have reward hacking tendencies.

For this part of the stack the interesting question to me is how to identify and mitigate.