Hacker News

You could also try PageIndex OCR, the first long-context OCR model. Most current OCR tools process each page independently, which causes them to lose the document’s structure and produce markdown with incorrect heading levels. PageIndex OCR generates markdown with more accurate heading levels to better capture the document’s structure.