So you've outsourced the parsing to whatever software you're using to render the PDF as an image.

Seems like a fairly reasonable decision given all the high quality implementations out there.

How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out? Sounds like "I don't know programming, so I will just use AI".

> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?

Because PDFs might not have the data in a structured form; how would you get the structured data out of an image in the PDF?

Sir, some of our cars breaks down every now and then, so we push them, because it happens every so often and we want to avoid it, we have implemented a policy of pushing all cars instead of driving them at all times. This removes the problem of pushing cars.

> instead of just using the "quality implementation" to actually get structured data out?

I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie.

EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.

We do a lot of parsing of PDFs and basically break the structure into 'letter with font at position (box)' because the "structure" within the PDF is unreliable.

We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words.

We handle hidden text and problematic glyph-to-unicode tables.

The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition.

The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it.

We do machine learning afterwards on the structure output too.

Very interesting. How often do you encounter PDFs that are just scanned pages? I had to make heavy use of pdfsandwich last time I was accessing journal articles.

> quality is higher because we don't depend on vision based text recognition

This surprises me a bit; outside of an actual scan leaving the computer I’d expect PDF->image->text in a computer to be essentially lossless.

This happens -- also variants which have been processed with OCR.

So if it is scanned it contains just a single image - no text.

OCR programs will commonly create a PDF where the text/background and detected images are separate. And then the OCR program inserts transparent (no-draw) letters in place of the text it has identified, or (less frequently) place the letters behind the scanned image in the PDF (i.e. with lower z).

We can detect if something has been generated by an OCR program by looking at the "Creator data" in the PDF that describes the program use to create the PDF. So we can handle that differently (and we do handle that a little bit differently).

PDF->image->text is 100% not lossless.

When you rasterize the PDF, you losing information because you are going from a resolution independent format to a specific resolution: • Text must be rasterized into letters at the target resolution • Images must be resampled at the target resolution • Vector paths must be rasterized to the target resolution

So for example the target resolution must be high enough that small text is legible.

If you perform OCR, you depend on the ability of the OCR program to accurately identify the letters based on the rasterized form.

OCR is not 100% accurate, because it is computer vision recognition problem, and • there are hundrends of thousands of fonts in the wild each with different details and appearances. • two letters can look the same; simple example where trivial OCR/recognition fails is capital letter "I" and lower case "l". These are both vertical lines, so you need the context (letters nearby). Same with "O" and zero. • OCR is also pretty hopeless with e.g. headlines/text written on top of images because it is hard to distinguish letters from the background. But even regular black on white text fails sometimes. • OCR will also commonly identify "ghost" letters in images that are not really there. I.e. pick up a bunch of pixels that have been detected as a letter, but really is just some pixel structure part of the image (not even necessarily text on the image) -- A form of hallucination.

As someone who had to parse form data from a pdf, where the pdf author named the inputs TextField1 TextField2 TextFueld3 etc.

Misspellings, default names, a mixture, home brew naming schemes, meticulous schemes, I’ve seen it all. It’s definitely easier to just rasterize it and OCR it.

Same. Then someone edits the form and changes the names of several inputs, obsoleting much of the previous work, some of which still needs to be maintained because multiple versions are floating around.

I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.

How many different PDF generators have done those millions of PDFs tho?

Because you're right if you're paid to evaluate all the formats with the Mark 1 eyeball and do a custom parser for each. It sounds like it's feasible for your application.

If you want a generic solution that doesn't rely on a human spending a week figuring out that those 4 absolutely positioned text fields are the invoice number together (and in order 1 4 2 3), maybe you're wrong.

Source: I don't parse pdfs for a living, but sometimes I have to select text out of pdf schematics. A lot of times I just give up and type what my Mark 1 eyeball sees in a text editor.

We process invoices from around the world, so more PDF generators than I care to count. It is hard a problem for sure, but the problem is the rendering, you can't escape that by rastering it, that is rendering.

So it is absurd to pretend you can solve the rendering problem by rendering it into an image instead of a structured format. By rendering it into a raster, now you have 3 problems, parsing the PDF, rendering quality raster, then OCR'ing the raster. It is mind numbingly absurd.

Rendering is a different problem from understanding what's rendered.

If your PDF renders a part of the sentence at the beginning of the document, a part in the middle, and a part at the end, split between multiple sections, it's still rather trivial to render.

To parse and understand that this is the same sentence? A completely different matter.

Computers "don't understand" things. They process things, and what you're saying is called layoutinng which is a key part of PDF rendering. I do understand for someone unfamiliar with the internals of file formats, parsing, text shapping, and rendering in general, it all might seem like a blackmagic.

No one said it was as black magic. In the context of OCR and parsing PDFs to convert them to structured data and/or text, rendering is a completely different task from text extraction.

As people have pointed out many times in the discussion: https://news.ycombinator.com/item?id=44783004, https://news.ycombinator.com/item?id=44782930, https://news.ycombinator.com/item?id=44789733 etc.

You're wrong. There is nothing inherent in "rendering" that means "raster or pixels". You can render PDFs or any format into any format you want, including XML for example.

In fact, in majority of PDFs, a large part of rendering has to do with composing text.

You are using the Mark 1 eyeball for each new type of invoice to figure out what field goes where, right?

It is a bit more involved, we have a rule engine that is fine tuned over time and works on most of invoices, there is also an experimental AI based engine that we are running in parallel but the rule based Engine still wins on old invoices.

I sort of agree... I do the same.

We also parse millions of PDFs per month in all kinds languages (both Western and Asian alphabets).

Getting the basics of PDF parsing to work is really not that complicated -- A few months work. And is an order of magnitude more efficient than generating an image in 300-600 DPI and doing OCR or Visual LLM.

But some of the challenges (which we have solved) are:

• Glyphs to unicode tables are often limited or incorrect • "Boxing" blocks of text into "paragraphs" can be tricky • Handling extra spaces and missing spaces between letters and words. Often PDFs do not include the spaces or they are incorrect so you need to identify gaps yourself. • Often graphic designers of magazines/newspapers will hide text behind e.g. a simple white rectangle, and place new version of the text above. So you need to keep track of z-order and ignore hidden text. • Common text can be embedded as vector paths -- Not just logos but we also see it with text. So you need a way to handle that. • Dropcap and similar "artistic" choices can be a bit painful

There are lot of other smaller issues -- but they are generally edge cases.

OCR handles some of these issues for you. But we found that OCR often misidentifies letters (all major OCR), and they are certainly not perfect with spaces either. So if you are going for quality, you can get better results if you parse the PDFs.

Visual Transformers are not good with accurate coordinates/boxing yet -- At least we haven't seen a good enough implementation of it yet. Even though it is getting better.

We tried the xml structured route, only to end up with pea soup afterwards. Rasterizing and OCR was the only way to get standardized output.

I know OCR is easier to set up, but you lose a lot going that way.

We process several million pages from Newspapers and Magazines from all over the world with medium to very high complexity layouts.

We built the PDF parser on top of open source PDF libraries, and this gives many advantages: • We can accurately get headlines other text placed on top on images. OCR is generally hopeless with text placed on top of images or on complex backgrounds • Distinguish letters accurately (i.e. number 1, I, l, "o", "zero") • OCR will pick up ghost letters from images, where OCR program believes there is text, even if there isn't. We don't. • We have much higher accuracy than OCR because we don't depend on the OCR programs' ability to recognize the letters. • We can utilize font information and accurate color information, which helps us distinguish elements from each other. • We have accurate bounding box locations of each letter, word, line, and block (pts).

To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms.

We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.

It may seem like a large undertaking, but it literally only took a few months to built this initially, and we have very rarely touched the code over the last 10 years. So it was a very good investment for us.

Obviously, you can achieve a lot of the same with OCR -- But you lose information, accuracy, and computational efficiency. And you depend on the OCR program you use. Best OCR programs are commercial and somewhat pricy at scale.

> To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms. We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.

Do you happen to have any sources for learning more about the piecing together process? E.g. the overal process and the algorithms involved etc. It sounds like an interesting problem to solve.

We were 99.99% accurate with our OCR method. It’s not just vanilla ocr but a couple of extractions of metadata (including the xml from the forms) and textract-like json of the document to perform ocr on the right parts.

A lot has changed in 10 years. This was for a major financial institution and it worked great.

Do you have your parser released as a service? Curious to test it out.

PDFs don't always lay out characters in sequence, sometimes they have absolutely positioned individual characters instead.

PDFs don't always use UTF-8, sometimes they assign random-seeming numbers to individual glyphs (this is common if unused glyphs are stripped from an embedded font, for example)

etc etc

But all those problems exist when rendering into a surface or rastering. I just don't understand how one thinks, this is a hard problem, let me make it harder by solving the problem into another kind of problem that is just as hard as solving it in the first place (PDF to structured data vs PDF to raster). And then solve the new problem, which is also hard. It is absurd.

The problems don't actually exist in the way you think.

When extracting text directly, the goal is to put it back into content order, regardless of stream order. Then turn that into a string. As fast as possible.

That's straight text. if you want layout info, it does more. But it's also not just processing it as a straight stream and rasterizing the result. It's trying to avoid doing that work.

This is non-trivial on lots of pdfs, and a source of lots of parsing issues/errors because it's not just processing it all and rasterizing it, but trying to avoid doing that.

When rasterizing, you don't care about any of this at all. PDFs were made to raster easily. It does not matter what order the text is in the file, or where the tables are, because if you parse it straight through, raster, and splat it to the screen, it will be in the proper display order and look right.

So if you splat it onto the screen, and then extract it, it will be in the proper content/display order for you. Same is true of the tables, etc.

So the direct extraction problems don't exist if you can parse the screen into whatever you want, with 100% accuracy (and of course it doesn't matter if you use AI or not to do it).

Now, i am not sure i would use this method anyway, but your claim that the same problems exist is definitely wrong.

I don’t think people are suggesting : Build a renderer > build an ocr pipeline > run it on pdfs

I think people are suggesting : Use a readymade renderer > use readymade OCR pipelines/apis > run it on pdfs

A colleague uses a document scanner to create a pdf of a document and sends it to you

You must return the data represented in it retaining as much structure as possible

How would you proceed? Return just the metadata of when the scan was made and how?

Genuinely wondering

You can use an existing readymade renderer to render into structured data instead of raster.

Just to illustrate this point, poppler [1] (which is the most popular pdf renderer in open source) has a little tool called pdf2cairo [2] which can render a pdf into a svg. This means you can delegate all pdf rendering to poppler and only work with actual graphical objects to extract semantics.

I think the reason this method is not popular is that there are still many ways to encode a semantic object graphically. A sentence can be broken down into words or letters. Table lines can be formed from multiple smaller lines, etc. But, as mentioned by the parent, rule based systems works reasonably well for reasonably focused problems. But you will never have a general purpose extractor since rules needs to be written by humans.

[1] https://poppler.freedesktop.org/ [2] https://gitlab.freedesktop.org/poppler/poppler/-/blob/master...

There is also PDF to HTML, PDF to Text, MuPDF also has PDF to XML, both projects along with a bucketful of other PDF toolkits have PDF to PS, and there is many many XML, HTML, and Text outputs for PS.

Rastering and OCR'ing PDF is like using regex to parse XHTML. My eyes are starting to bleed out, I am done here.

It looks like you make a lot of valid points, but also have an extremely visceral reaction because theres a company out there thats using AI in a way that offends you. I mean fair still.

But im a guy who's in the market for a pdf parser service, im happy to pay pretty penny per page processed. I just want a service that works without me thinking for a second about any of the problems you guys are all discussing. What service do I use? Do I care if it uses AI in the lamest way possible? The only thing that matters are the results. There are two people including you in this thread ramming with pdf parsing gyan but from reading it all, it doesn't look like I can do things the right way without spending months fully immersed in this problem alone. If you or anyone has a non blunt AI service that I can use Ill be glad to check it out.

It is a hard problem, yes, but you don't solve it by rastering it, OCR, and then using AI. You render it into a structured format. Then at least you don't have to worry about hallucinations, fancy fonts OCR problems, text shaping problems, huge waste of GPU and CPU to paint an image only to OCR it and throw it away.

Use a solution that renders PDF into structured data if you want correct and reliable data.

pdftotext from poppler has that without doing juggling with formats.

Sometimes scanned documents are structured really weird, especially for tables. Visually, we can recognize the intention when it's rendered, and so can the AI, but you practically have to render it to recover the spatial context.

But why do you have to render it into bitmap?

PDF to raster seems a lot easier than PDF to structured data, at least in terms of dealing with the odd edge cases. PDF is designed to raster consistently, and if someone generates something that doesn't raster in enough viewers, they'll fix it. PDF does not have anything that constrains generators to a sensible structured representation of the information in the document, and most people generating PDF documents are going to look at the output, not run it through a system to extract the structured data.

> How is it reasonable to render the PDF, rasterize it, OCR it, use AI, instead of just using the "quality implementation" to actually get structured data out?

Because the underlying "structured data" is never checked while the visual output is checked by dozens of people.

"Truth" is the stuff that the meatbags call "truth" as seen by their squishy ocular balls--what the computer sees doesn't matter.

Your mistake is in thinking that computers "see the image", second, you somehow think the output of OCR is different from a PDF engine that renders it into structured data/text.

There are many cases images are exported as PDFs. Think invoices or financial statements that people send to financial services companies. Using layout understanding and OCR based techniques leads to way better results than writing a parser which relies on the files metadata.

The other thing is segmenting a document and linearizing it so that an LLM can understand the content better. Layout understanding helps with figuring out the natural reading order of various blocks of the page.

  > There are many cases images are exported as PDFs.
One client of a client would print out her documents, then "scan" them with an Android app (actually just a photograph wrapped in a PDF). She was taught that this application is the way to create PDF files, and would staunchly not be retrained. She came up with this print-then-photograph after being told not to photograph the computer monitor - that's the furthest retraining she was able to absorb.

Be there no mistake, this woman was extremely successful at her field. Successful enough to be a client of my client. But she was taught that PDF equals that specific app, and wasn't going to change her workflow to accommodate others.

PDF is a list of drawing commands (not exactly but a useful simplification). All those draw commands from some JS lib or in SVG? Or in every other platform's API? PDF or Postscript probably did them first. The model of "there is some canvas in which I define coordinate spaces then issue commands to draw $thing at position $(x,y), scaled by $z".

You might think of your post as a <div>. Some kind of paragraph or box of text in which the text is laid out and styles applied. That's how HTML does it.

PDF doesn't necessarily work that way. Different lines, words, or letters can be in entirely different places in the document. Anything that resembles a separator, table, etc can also be anywhere in the document and might be output as a bunch of separate lines disconnected from both each other and the text. A renderer might output two-column text as it runs horizontally across the page so when you "parse" it by machine the text from both columns gets interleaved. Or it might output the columns separately.

You can see a user-visible side-effect of this when PDF text selection is done the straightforward way: sometimes you have no problem selecting text. In other documents selection seems to jump around or select abject nonsense unrelated to cursor position. That's because the underlying objects are not laid out in a display "flow" the way HTML does by default so selection is selecting the next object in the document rather than the next object by visual position.

Because PDF is as much a vector graphics format as a document format, you cannot expect the data to be reasonably structured. For example applications can convert text to vector outlines or bitmaps for practical or artistic purposes (anyone who ever had to deal with transparency "flattening" issues knows the pain), ideally they also encode the text in a seperate semantic representation. But many times PDF files are exported from "image centric" programs with image centric workflows (e.g. Illustrator, CorelDraw, Indesign, QuarkXpress, etc) where the main issue being solved for is presentational content, not semantic. For example if I receive a Word document and need to layout it so it fits into my multi column magazine layout I will take the source text and break it into seperate sections which then manually get copy and pasted into InDesign. You can import the document directly but for all kinds of practical reasons this is not the default way of working. Some asides and lists might be broken out of the original flow of text and placed in their own text field, etc. So now you lost the original semantic structure. Remember, this is how desktop publishing evolved: for print, which has no notion of structure or metadata embedded into the ink or paper. Another common usecase is to simply have resolution independent graphics, again, display purposes only, no structured data is required nor expected.

> Sounds like "I don't know programming, so I will just use AI".

If you were leading Tensorlake, running on early stage VC with only 10 employees (https://pitchbook.com/profiles/company/594250-75), you'd focus all your resources on shipping products quickly, iterating over unseen customer needs that could make the business skyrocket, and making your customers so happy that they tell everyone and buy lots more licenses.

Because you're a stellar tech leader and strategist, you wouldn't waste a penny reinventing low-level plumbing that's available off-the-shelf, either cheaply or as free OSS. You'd be thinking about the inevitable opportunity costs: If I build X then I can't build Y, simply because a tiny startup doesn't have enough resources to build X and Y. You'd quickly conclude that building a homegrown, robust PDF parser would be an open-ended tar pit that precludes us from focusing on making our customers happy and growing the business.

And the rest of us would watch in awe, seeing truly great tech leadership at work, making it all look easy.

I would hire someone who understands PDFs instead of doing the equivalent of printing a digital document and scanning it for "digital record keeping". Stop everything and hire someone who understands the basics of data processing and some PDF.

What's the economic justification?

Let's assume we have a staff of 10 and they're fully allocated to committed features and deadlines, so they can't be shifted elsewhere. You're the CTO and you ask the BOD for another $150k/y (fully burdened) + equity to hire a new developer with PDF skills.

The COB asks you directly: "You can get a battle-tested PDF parser off-the-shelf for little or no cost. We're not in the PDF parser business, and we know that building a robust PDF parser is an open-ended project, because real-world PDFs are so gross inside. Why are you asking for new money to build our own PDF parser? What's your economic argument?"

And the killer question comes next: "Why aren't you spending that $150k/y on building functionality that our customers need?" If don't give a convincing business justification, you're shoved out the door because, as a CTO, your job is building technology that satisfies the business objectives.

So CTO, what's your economic answer?

The mistake all of you're making is the assumption that PDF rendering means rasteration. Everything else crumbles down from that misconception.

So if you receive a pdf full of sections containing prerasterized text (e.g adverts, 3d rendered text with image effects, scanned documents, handwritten errata), what do you do? You cannot use OCR because apparently only pdf-illiterate idiots would try such a thing?

I wouldn't start by rastering the rest of the PDF. In business world, unlike academia and bootleg books and file sharing, majority of PDFs are computer generated. I know because I do this for a living.

I just spent a few weeks testing about 25 different pdf engines to parse files and extract text.

Only three of them can process all 2500 files i tried (which are just machine manuals from major manufacaturers, so not highly weird shit) without hitting errors, let alone producing correct results.

About 10 of them have a 5% or less failure rate on parsing the files (let alone extracting text). This is horrible.

It then goes very downhill.

I'm retired, so i have time to fuck around like this. But going into it, there is no way i would have expected these results, or had time to figure out which 3 libraries could actually be used.

I think it's reasonable because their models are probably trained on images, and not whatever "structured data" you may get out of a PDF.

Yes this! We training it on a ton of diverse document images to learn reading order and layouts of documents :)

But you have to render the PDF to get an image, right? How do you go from PDF to raster?

No model can do better on images than structured data. I am not sure if I am on crack or you're all talking nonsense.

You are assuming structure where there is none. It's not the crack, it's the lack of experience with PDF from diverse sources. Just for instance, I had a period where I was _regularly_ working with PDF files with the letters in reverse order, each letter laid out individually (not a single complete word in the file).

You're thinking "rendering structured data" means parsing PDF as text. That is just wrong. Carefully read what I said. You render the PDF, but into structured data rather than raster. If you still get letters in reverse when you render your PDF into structured data, your rendering engine is broken.

How do you render into structured data, from disparate letters that are not structured?

  D10
  E1
  H0
  L2,3,9
  O4,7
  R8
  W6
I'm sure that you could look at that and figure out how to structure it. But I highly doubt that you have a general-purpose computer program that can parse that into structured data, having never encountered such a format before. Yet, that is how many real-world PDF files are composed.

It is called rendering. MuPDF, Poppler, PDFjs, and so on. The problem is that you and everyone else thinks "rendering" means bitmaps. That is not how it works.

Then I would very much appreciate if you would enlighten me. I'm serious, I would love nothing more than for you to prove your point, teach me something, and win an internet argument. Once rendered, do any of the rendering engines have e.g. a selectable or accessible text? Poppler didn't, neither did some Java library that I tried.

For me, learning something new is very much worth losing the internet argument!

I have explained the details in other comments, have a look. But you can start by looking at pdftotext from Poppler, it is ready to go for 60-70% of cases with -layout flag, with bbox-layout you get even more details.

Thank you. Even with box layout one can not even know that there is a coherent word or phrase to extract, without visually inspecting the PDF beforehand. I've been there, fighting with it right in the CLI and finding that there is no way to even progress to a script.

The advantage of the OCR method is that it effectively performs that visual inspection. That's why it is preferable for PDFs of disparate origin.

What kind of semantics can you infer from the text of OCRing a bitmap that you can't infer from the text generated directly from the PDF? Is it the lack of OCR mistakes? The hallucinations? Or something else?

In the cases that I've seen, the PDF software does not generate text strings. It generates individual letters. It is up to any application to try to figure out how those individual letters relate to one another.

Did you even read my comment? The "application" is called pdftotext, and instead of putting the individual letters on a bitmap, it puts them in a string literal.

  > just using the "quality implementation"?
What is the quality implementation?
[deleted]