I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.
I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.
How many different PDF generators have done those millions of PDFs tho?
Because you're right if you're paid to evaluate all the formats with the Mark 1 eyeball and do a custom parser for each. It sounds like it's feasible for your application.
If you want a generic solution that doesn't rely on a human spending a week figuring out that those 4 absolutely positioned text fields are the invoice number together (and in order 1 4 2 3), maybe you're wrong.
Source: I don't parse pdfs for a living, but sometimes I have to select text out of pdf schematics. A lot of times I just give up and type what my Mark 1 eyeball sees in a text editor.
We process invoices from around the world, so more PDF generators than I care to count. It is hard a problem for sure, but the problem is the rendering, you can't escape that by rastering it, that is rendering.
So it is absurd to pretend you can solve the rendering problem by rendering it into an image instead of a structured format. By rendering it into a raster, now you have 3 problems, parsing the PDF, rendering quality raster, then OCR'ing the raster. It is mind numbingly absurd.
Rendering is a different problem from understanding what's rendered.
If your PDF renders a part of the sentence at the beginning of the document, a part in the middle, and a part at the end, split between multiple sections, it's still rather trivial to render.
To parse and understand that this is the same sentence? A completely different matter.
Computers "don't understand" things. They process things, and what you're saying is called layoutinng which is a key part of PDF rendering. I do understand for someone unfamiliar with the internals of file formats, parsing, text shapping, and rendering in general, it all might seem like a blackmagic.
No one said it was as black magic. In the context of OCR and parsing PDFs to convert them to structured data and/or text, rendering is a completely different task from text extraction.
As people have pointed out many times in the discussion: https://news.ycombinator.com/item?id=44783004, https://news.ycombinator.com/item?id=44782930, https://news.ycombinator.com/item?id=44789733 etc.
You're wrong. There is nothing inherent in "rendering" that means "raster or pixels". You can render PDFs or any format into any format you want, including XML for example.
In fact, in majority of PDFs, a large part of rendering has to do with composing text.
You are using the Mark 1 eyeball for each new type of invoice to figure out what field goes where, right?
It is a bit more involved, we have a rule engine that is fine tuned over time and works on most of invoices, there is also an experimental AI based engine that we are running in parallel but the rule based Engine still wins on old invoices.
I sort of agree... I do the same.
We also parse millions of PDFs per month in all kinds languages (both Western and Asian alphabets).
Getting the basics of PDF parsing to work is really not that complicated -- A few months work. And is an order of magnitude more efficient than generating an image in 300-600 DPI and doing OCR or Visual LLM.
But some of the challenges (which we have solved) are:
• Glyphs to unicode tables are often limited or incorrect • "Boxing" blocks of text into "paragraphs" can be tricky • Handling extra spaces and missing spaces between letters and words. Often PDFs do not include the spaces or they are incorrect so you need to identify gaps yourself. • Often graphic designers of magazines/newspapers will hide text behind e.g. a simple white rectangle, and place new version of the text above. So you need to keep track of z-order and ignore hidden text. • Common text can be embedded as vector paths -- Not just logos but we also see it with text. So you need a way to handle that. • Dropcap and similar "artistic" choices can be a bit painful
There are lot of other smaller issues -- but they are generally edge cases.
OCR handles some of these issues for you. But we found that OCR often misidentifies letters (all major OCR), and they are certainly not perfect with spaces either. So if you are going for quality, you can get better results if you parse the PDFs.
Visual Transformers are not good with accurate coordinates/boxing yet -- At least we haven't seen a good enough implementation of it yet. Even though it is getting better.
We tried the xml structured route, only to end up with pea soup afterwards. Rasterizing and OCR was the only way to get standardized output.
I know OCR is easier to set up, but you lose a lot going that way.
We process several million pages from Newspapers and Magazines from all over the world with medium to very high complexity layouts.
We built the PDF parser on top of open source PDF libraries, and this gives many advantages: • We can accurately get headlines other text placed on top on images. OCR is generally hopeless with text placed on top of images or on complex backgrounds • Distinguish letters accurately (i.e. number 1, I, l, "o", "zero") • OCR will pick up ghost letters from images, where OCR program believes there is text, even if there isn't. We don't. • We have much higher accuracy than OCR because we don't depend on the OCR programs' ability to recognize the letters. • We can utilize font information and accurate color information, which helps us distinguish elements from each other. • We have accurate bounding box locations of each letter, word, line, and block (pts).
To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms.
We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.
It may seem like a large undertaking, but it literally only took a few months to built this initially, and we have very rarely touched the code over the last 10 years. So it was a very good investment for us.
Obviously, you can achieve a lot of the same with OCR -- But you lose information, accuracy, and computational efficiency. And you depend on the OCR program you use. Best OCR programs are commercial and somewhat pricy at scale.
> To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms. We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.
Do you happen to have any sources for learning more about the piecing together process? E.g. the overal process and the algorithms involved etc. It sounds like an interesting problem to solve.
We were 99.99% accurate with our OCR method. It’s not just vanilla ocr but a couple of extractions of metadata (including the xml from the forms) and textract-like json of the document to perform ocr on the right parts.
A lot has changed in 10 years. This was for a major financial institution and it worked great.
Do you have your parser released as a service? Curious to test it out.