Hacker News

viraptor 3 days ago [ - ]

I've done only one pipeline trying parse actual PDF structure and the least surprising part of it is that some documents have top-to-bottom layout and others have bottom-to-top, flipped, with text flipped again to be readable. It only goes worse from there. Absurd is correct.

Muromec 3 days ago [ - ]

That means you have to put the text (each infividual letter) into its correct place by rendering pdf, but doesnt justify actual OCR which goes one step further and back by rendering and backguessing the glyphs. But thats just text, tables and structure are also somewhere there to be recovered.