I know OCR is easier to set up, but you lose a lot going that way.

We process several million pages from Newspapers and Magazines from all over the world with medium to very high complexity layouts.

We built the PDF parser on top of open source PDF libraries, and this gives many advantages: • We can accurately get headlines other text placed on top on images. OCR is generally hopeless with text placed on top of images or on complex backgrounds • Distinguish letters accurately (i.e. number 1, I, l, "o", "zero") • OCR will pick up ghost letters from images, where OCR program believes there is text, even if there isn't. We don't. • We have much higher accuracy than OCR because we don't depend on the OCR programs' ability to recognize the letters. • We can utilize font information and accurate color information, which helps us distinguish elements from each other. • We have accurate bounding box locations of each letter, word, line, and block (pts).

To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms.

We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.

It may seem like a large undertaking, but it literally only took a few months to built this initially, and we have very rarely touched the code over the last 10 years. So it was a very good investment for us.

Obviously, you can achieve a lot of the same with OCR -- But you lose information, accuracy, and computational efficiency. And you depend on the OCR program you use. Best OCR programs are commercial and somewhat pricy at scale.

> To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms. We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.

Do you happen to have any sources for learning more about the piecing together process? E.g. the overal process and the algorithms involved etc. It sounds like an interesting problem to solve.

We were 99.99% accurate with our OCR method. It’s not just vanilla ocr but a couple of extractions of metadata (including the xml from the forms) and textract-like json of the document to perform ocr on the right parts.

A lot has changed in 10 years. This was for a major financial institution and it worked great.

Do you have your parser released as a service? Curious to test it out.