Do you reward the RL model based on the token consumption when multiple LLMs complete the task ?