I have gone through this process and evaluated the results. Maybe you're referring to their comment as written, but going through what OC described + handholding leads to very good results in my experience.

I agree with you agentdev! Here, you want accurate results, you need to have harness in place to control the quality of output.

"very good" 99 percent of time and hallucinating 1 percent makes the "very good" part untrustworthy.

The "Very good" I'm referring to is far better than only 99%. I can't offer solid stats off the top sadly, so you'll have to just take my word for it ;)

I'll take the opportunity to note that if you're running solid evals, you'll have data to back the efficacy of your system. If you are seeing a hallucination rate of 1%, then you certainly should be working on your harness/toolset/context/prompting etc.

Saying "1% hallucination rate..." is akin to saying "30,000mi lifespan for [modern japanese make engine]". Something is wrong.