If you could write that reward function you wouldn't need an LLM, you'd just query the reward function to answer any question. You can create a benchmark and check that automatically, but you can't solve this in the general case. The model can do well on the benchmark but still give overconfident answers in areas the benchmark doesn't cover.

You can definitely tune a model to say "I don't know" more often but it will cost you performance, the model will reject some questions that it could answer meaningfully. In the degenerate case the model could collapse predicting that sequence always or almost always.

I guess so. Just to be clear, I was talking about post-training methods for reasoning models here, not pre-training. I think "model as a judge" should actually do okay as a "sentiment analysis" style reward for expressing uncertainty. So if none of the thousands of reasoning traces you generate reach the validated answer, you run a judge to rate uncertainty and put those reasoning traces back into the training pool.

But I guess my logic breaks down here a bit, because if there is such a thing as a validated answer, then the correct answer is in fact never uncertainty. The correct answer is to continue post training until the model gets it right. So perhaps the real answer is to create RLVR tasks where the valid answer is "I don't know" and nothing else, like this benchmark does. Or maybe that doesn't work either, no matter how many you create.

I feel as though there is some kind of philosophical lesson to be had from how hard hallucinations are to get rid of. Maybe, similarly to humans, successful models are often "arrogant" in a sense. Perhaps you just never solve an Erdös problem without some degree of self deception that it's possible for you to do so. In this line of thinking, greatness in humans is actually not related to humility, but just being so good that you actually get things right when you try. Expressing humility is of course something great people tend to do, but I'm referring to what happens under the hood.

If you squint a bit, that's kinda the trend with models. The useful ones are not that much less likely to hallucinate, they are just good enough that they tend to get it right. This comparison is of course probably not even remotely correct, but at least it's fun to anthropomorphize a bit.