This is no different from any other ML situation, though? Famously, people found out that Amazon's hands free checkout thing was being offloaded to people in the cases where the system couldn't give a high confidence answer. I would be shocked to know that those judgements were not then labeled and used in automated training later.
And I should say that I said "codified" but I don't mean just code. Labeled training samples is fine here. Doesn't change that finding a model that will give good answers is ultimately something that can be conceptualized as a search.
You are also blurring the reinforcement/scoring at inference time as compared to the work that is done at training time? The idea of using RL at training time is not just because it is expensive there. The goal is to find the policies that are best to use at inference time.