well, as yann lecun said :

"Adversarial training, RLHF, and input-space contrastive methods have limited performance. Why? Because input spaces are BIG. There are just too many ways to be wrong" [1]

A way to solve the problem is projecting onto latent space and then try and discriminate/predict the best action down there. There's much less feature correlation down in latent space than in your observation space. [2]

[1]:https://x.com/ylecun/status/1803696298068971992 [2]: https://openreview.net/pdf?id=BZ5a1r-kVsf