Yeah, thanks for pointing out the OCR! We also found that for complex PDFs, you first need to use OCR to convert them into Markdown and then run PageIndex. However, most OCR tools process each page independently, which causes them to lose the overall document structure. For example, existing OCR tools often generate incorrect heading levels, which is a big problem if you want to build a tree structure from them. You could check out PageIndex-OCR, the first long-context OCR model that can produce Markdown with more accurate heading-level recognition.

I am always on the lookout for new document extraction tools, but can't seem to find any benchmarks for PageIndex-OCR. There are several like OmniDocBench and readoc. So... Got benchmark?