Having an RL agent that's really good at search across some space sounds very powerful in general; "proofs-as-search" make this an appealing target. Back in the day, when I did more fundamental RL research, we worked on an extension of SoRB [0] where an additional meta-level target was learning improved heuristics to explore the search space faster; would be exciting to figure out what a good setup for doing things like this in LLM-policy-gradient world is these days!