I think one problem is that the models that hallucinate often, a few times out of 8 or 16 so that they get good results on benchmarks, most of which measures success out of top k. From benchmark perspective, you don't really care whether 15 of yours 16 generations failed, as long as one succeeded, but as a user you mostly care that 1 out of 16 you get is actually the successful one. I think this effects is more easy to see on Gemini Flash, it hallucinates like crazy but looks like its by design to boost benchmarks.