We did a lot of internal testing but no official benchmark.

We find that the less the agent knows, the more it hallucinates