That is such a cop-out, if there was a really good benchmark for getting rid of hallucinations then it would be included in every eval comparison graph.

The real reason is that every bench I've seen has Anthropic with lower hallucinations.