It’s not as simple. I trained an LLM before on exactly this, to scratch the itch of this question.

The task was simple, using the MS-MARCO[0] dataset which contains queries, search results, answers, I made a training set that has:

1. Questions paired with real results supporting them (mixed with some irrelevant results), and a correct answer

2. Questions paired only with irrelevant results, with the answer “No answer present”

The dataset was huge (close to 1M samples), and I trained using different techniques, from SFT (just mimicking the dataset) to DPO (good answer contrasted with a bad answer for the same user query) to GRPO (verifier that checks my annotations whether an answer was present or not)

Lo and behold, this didn’t reduce hallucination, rather made it much worse. Now the model started claiming “No answer present” even when it is, or even when the question didn’t need search results in the first place (simple stuff like what is X+Y).

Now you could argue that my training was basic compared to what frontier labs could do. Yet I think it hints at a more profound limitation. LLMs are finicky and don’t have a neat understand of things from first principles (list of search results, check relevance of result to user query, if answers are below a certain threshold of relevance then don’t consider them to answer …).

tl;dr: not as simple as one might think, perhaps not attainable at all.

0: https://huggingface.co/datasets/microsoft/ms_marco

Thank you for sharing! Based on your experience, do you think a two-model system might fare better? For example, two models in serial where the second model is trained to "sniff out" potential hallucinations and fact check them (and possibly iterate with the first model)?