Hacker News

In my experience it really depends on what sort of pdfs you are trying to extract (ie what the content is).

For regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text and for those I’ve had a lot of success on general pdfs using pypdf.

“Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR. At the moment my personal rag pipeline is doing this using a local Gemma4 model (you could use something else).

Either way I do an audit post-ingest where I select a random set of pages and also get the local gemma model to try those same set and compare. The symptoms to look out for here will depend a lot on what you’re trying to extract but I’m extracting maths mostly so I get the model to check extraction of symbols, equations etc. One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding) as this almost always catches pdfs that have just extracted as pure garbage. I added this step because I was ingesting a lot of old maths pdfs which have specialist notation that wasn’t always getting correctly ingested and as they were image pdfs it was coming in as pure garbage. So the fix here is to use a specialist ocr service (I have been using “mathpix” which has been great and isn’t too expensive if you don’t want to do too much).

The other thing that can cause problems is things like tables (eg if you were trying to ingest a lot of pdfs like financials of companies etc). Those can cause problems for both the ocr and the pure text extraction methods. I don’t have a current recommendation for that because I haven’t done it recently enough and the state of the art has moved a lot. It’s something to be aware of that will require special treatment though.

> regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text

Producing "normal PDFs" that way actually requires specific LaTeX options to be enabled in my experience. Without that, PDF viewers have to perform all kinds of ugly hacks to even figure out what Unicode codepoint a given glyph is supposed to represent! PDFs are much more of a vector format than a layouting program than most people seem to realize.

> One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding)

This is exactly the problem with PDFs: It's not regular mojibake (i.e. interpreting a string of text in the wrong charset), but rather some PDF processor's failed attempt at mapping glyphs back to codepoints without an explicit mapping table being present in the PDF, which is something that the creator actively has to do.

> “Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR.

For the reason above and others, in my experience, OCR actually works significantly better than trying to "semantically parse" the PDF.

lxgr 10 hours ago [ - ]

seanhunter 6 hours ago [ - ]

Hmm. Not sure what I'm doing that's special but both latex pdfs I produce and others that I read generally work just fine with pypdf, and I really am not adding any flags at all (my makefile says I just go

   latexmk --lualatex -aux-directory=output -output-directory=output $<

). Maybe latexmk is adding some magic?

lxgr 6 hours ago [ - ]

\usepackage{cmap} is usually what does that:

> The cmap package provides character map tables, which make PDF files generated by pdfLATEX both searchable and copy-able in acrobat reader and other compliant PDF viewers.

(from https://ctan.org/pkg/cmap)

seanhunter 5 hours ago [ - ]

I don't use cmap or pdflatex. Weird.