You can almost tell the "era" that a solution was built in these days since things are changing so fast.

Mid-2026, we have very large context windows, and much smarter models than we did in 2024 when this was built. If I were to tackle this today I'd ask a current frontier model to work through the source data and design a hierarchy that would give it the ability to sift through the content itself by drilling down as it sees fit, and I expect it would nail that.

It would not, and you would know that if you actually evaluated the results.

I have gone through this process and evaluated the results. Maybe you're referring to their comment as written, but going through what OC described + handholding leads to very good results in my experience.

I agree with you agentdev! Here, you want accurate results, you need to have harness in place to control the quality of output.

"very good" 99 percent of time and hallucinating 1 percent makes the "very good" part untrustworthy.

The "Very good" I'm referring to is far better than only 99%. I can't offer solid stats off the top sadly, so you'll have to just take my word for it ;)

I'll take the opportunity to note that if you're running solid evals, you'll have data to back the efficacy of your system. If you are seeing a hallucination rate of 1%, then you certainly should be working on your harness/toolset/context/prompting etc.

Saying "1% hallucination rate..." is akin to saying "30,000mi lifespan for [modern japanese make engine]". Something is wrong.

i can almost tell then you have not done anything like this in production scale. context window size is irrelevant.