I'd love to see something that can RL an agent (of sorts) that interacts with an interactive theorem prover (like Lean4, Coq, or Isabelle/HOL), (probably/likely via a harness instead of plain shell-like interaction), and actively exploits that discovery itself is not harmful beyond the inference and oracle cost of investigating an abondoned branch.
I.e., it's not at all like a typical game, because at no point is "success rate without relying on rollback/savestate-reloading" something that actually matters. An agent that spends evenly on abandoned (exploratory) branches, and on the path that becomes part of the solution that the formal verifier checks to confirm, while having a near-100% solve rate for problems fed to the agent, is a VERY GOOD agent.
That's because this task unlike most RL tasks is one where the agent shall utilize discovery to log an interaction trace that can be trivially mechanically trimmed to a verifiable proof for the provided problem. I.e., the hard part is finding ANY path that solves, without spending exponential amounts of compute to brute force the problem over the bounded state size of practical relevance. Because that would be something that takes longer than the heat death of the universe: i.e.,it's theoretically impractical.
Most RL tasks want an agent that is particularly good at it's task; and while effort spent to find a proof is certainly something that matters (if only because lower cost means the agent can train on more instances with the same training budget), it's much less relevant than the solve rate itself (fraction of problems for which any verifiably-correct proof sequence can be found at some definable level of effort, expressed as e.g. number of shots, total compute budget for the instance, ratio of exploration nodes to those nodes that become part of the final proof sequence, etc.).
Considering that non-benchmark usage would mostly entail semi-synthetic crowd-sourced datasets that are open sub-instances from practical applications of formal verification, as well as more-synthetic instances from very coarse high-level questions (that get mechanically broken down into more-manageable chunks before the RL agent gets to work) like "given these more-specific rules of what is _actually_ UB and what is only UB in ANSI but actually defined in the specific toolchain that we use: does that C program over there contain ANY UB?" or "is there ANY way that input at that file/network-socket over there to that program over here could execute arbitrary code", there'd not be economic incentive to solve any given instance more than once, beyond what is necessary to make the RL training process itself stable.
That task also lends itself to semi-online learning as every supplied instance essentially pays once for a verified solution and the overall process should deliver solid ROI. Running a single GPU cluster/pile for both training and inference would allow higher utilization at the cost of running with some variable amount of latency between rolling out an episode and training on the completed episode's oracle-verified rewards.
Having an RL agent that's really good at search across some space sounds very powerful in general; "proofs-as-search" make this an appealing target. Back in the day, when I did more fundamental RL research, we worked on an extension of SoRB [0] where an additional meta-level target was learning improved heuristics to explore the search space faster; would be exciting to figure out what a good setup for doing things like this in LLM-policy-gradient world is these days!
[0]: https://arxiv.org/abs/1906.05253