I've built document parsing pipelines for a few clients recently, and yeah this approach yields way superior results using what's currently available. Which is completely absurd, but here we are.
I've built document parsing pipelines for a few clients recently, and yeah this approach yields way superior results using what's currently available. Which is completely absurd, but here we are.
I've done only one pipeline trying parse actual PDF structure and the least surprising part of it is that some documents have top-to-bottom layout and others have bottom-to-top, flipped, with text flipped again to be readable. It only goes worse from there. Absurd is correct.
That means you have to put the text (each infividual letter) into its correct place by rendering pdf, but doesnt justify actual OCR which goes one step further and back by rendering and backguessing the glyphs. But thats just text, tables and structure are also somewhere there to be recovered.
Jesus Christ. What other approaches did you try?