"very good" 99 percent of time and hallucinating 1 percent makes the "very good" part untrustworthy.

The "Very good" I'm referring to is far better than only 99%. I can't offer solid stats off the top sadly, so you'll have to just take my word for it ;)

I'll take the opportunity to note that if you're running solid evals, you'll have data to back the efficacy of your system. If you are seeing a hallucination rate of 1%, then you certainly should be working on your harness/toolset/context/prompting etc.

Saying "1% hallucination rate..." is akin to saying "30,000mi lifespan for [modern japanese make engine]". Something is wrong.