Reward hacking is a well known and tracked problem at frontier labs - Claude 4’s system card reports on it for instance. It’s not surprising that a framework built on current llms would have reward hacking tendencies.

For this part of the stack the interesting question to me is how to identify and mitigate.