Hacker News

You are assuming structure where there is none. It's not the crack, it's the lack of experience with PDF from diverse sources. Just for instance, I had a period where I was _regularly_ working with PDF files with the letters in reverse order, each letter laid out individually (not a single complete word in the file).

throwaway4496 2 days ago [ - ]

You're thinking "rendering structured data" means parsing PDF as text. That is just wrong. Carefully read what I said. You render the PDF, but into structured data rather than raster. If you still get letters in reverse when you render your PDF into structured data, your rendering engine is broken.

dotancohen 2 days ago [ - ]

How do you render into structured data, from disparate letters that are not structured?

  D10
  E1
  H0
  L2,3,9
  O4,7
  R8
  W6

I'm sure that you could look at that and figure out how to structure it. But I highly doubt that you have a general-purpose computer program that can parse that into structured data, having never encountered such a format before. Yet, that is how many real-world PDF files are composed.

throwaway4496 2 days ago [ - ]

It is called rendering. MuPDF, Poppler, PDFjs, and so on. The problem is that you and everyone else thinks "rendering" means bitmaps. That is not how it works.

dotancohen 2 days ago [ - ]

Then I would very much appreciate if you would enlighten me. I'm serious, I would love nothing more than for you to prove your point, teach me something, and win an internet argument. Once rendered, do any of the rendering engines have e.g. a selectable or accessible text? Poppler didn't, neither did some Java library that I tried.

For me, learning something new is very much worth losing the internet argument!

throwaway4496 21 hours ago [ - ]

I have explained the details in other comments, have a look. But you can start by looking at pdftotext from Poppler, it is ready to go for 60-70% of cases with -layout flag, with bbox-layout you get even more details.

dotancohen 20 hours ago [ - ]

Thank you. Even with box layout one can not even know that there is a coherent word or phrase to extract, without visually inspecting the PDF beforehand. I've been there, fighting with it right in the CLI and finding that there is no way to even progress to a script.

The advantage of the OCR method is that it effectively performs that visual inspection. That's why it is preferable for PDFs of disparate origin.

throwaway4496 13 hours ago [ - ]

What kind of semantics can you infer from the text of OCRing a bitmap that you can't infer from the text generated directly from the PDF? Is it the lack of OCR mistakes? The hallucinations? Or something else?

dotancohen 4 hours ago [ - ]

In the cases that I've seen, the PDF software does not generate text strings. It generates individual letters. It is up to any application to try to figure out how those individual letters relate to one another.

throwaway4496 an hour ago [ - ]

Did you even read my comment? The "application" is called pdftotext, and instead of putting the individual letters on a bitmap, it puts them in a string literal.