> LLM can be trained to produce "I don't know" when confidence in other answers is weak
I'm unaware of -- and would love to find some -- convincing studies showing that LLMs have any kind of internal confidence metric. The closest I've seen is reflective chain-of-thought after the fact, and then trying to use per-token selection scores, which is doomed to fail (see: https://vlmsarebiased.github.io/)