How many different PDF generators have done those millions of PDFs tho?
Because you're right if you're paid to evaluate all the formats with the Mark 1 eyeball and do a custom parser for each. It sounds like it's feasible for your application.
If you want a generic solution that doesn't rely on a human spending a week figuring out that those 4 absolutely positioned text fields are the invoice number together (and in order 1 4 2 3), maybe you're wrong.
Source: I don't parse pdfs for a living, but sometimes I have to select text out of pdf schematics. A lot of times I just give up and type what my Mark 1 eyeball sees in a text editor.
We process invoices from around the world, so more PDF generators than I care to count. It is hard a problem for sure, but the problem is the rendering, you can't escape that by rastering it, that is rendering.
So it is absurd to pretend you can solve the rendering problem by rendering it into an image instead of a structured format. By rendering it into a raster, now you have 3 problems, parsing the PDF, rendering quality raster, then OCR'ing the raster. It is mind numbingly absurd.
Rendering is a different problem from understanding what's rendered.
If your PDF renders a part of the sentence at the beginning of the document, a part in the middle, and a part at the end, split between multiple sections, it's still rather trivial to render.
To parse and understand that this is the same sentence? A completely different matter.
Computers "don't understand" things. They process things, and what you're saying is called layoutinng which is a key part of PDF rendering. I do understand for someone unfamiliar with the internals of file formats, parsing, text shapping, and rendering in general, it all might seem like a blackmagic.
No one said it was as black magic. In the context of OCR and parsing PDFs to convert them to structured data and/or text, rendering is a completely different task from text extraction.
As people have pointed out many times in the discussion: https://news.ycombinator.com/item?id=44783004, https://news.ycombinator.com/item?id=44782930, https://news.ycombinator.com/item?id=44789733 etc.
You're wrong. There is nothing inherent in "rendering" that means "raster or pixels". You can render PDFs or any format into any format you want, including XML for example.
In fact, in majority of PDFs, a large part of rendering has to do with composing text.
You are using the Mark 1 eyeball for each new type of invoice to figure out what field goes where, right?
It is a bit more involved, we have a rule engine that is fine tuned over time and works on most of invoices, there is also an experimental AI based engine that we are running in parallel but the rule based Engine still wins on old invoices.