> regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text
Producing "normal PDFs" that way actually requires specific LaTeX options to be enabled in my experience. Without that, PDF viewers have to perform all kinds of ugly hacks to even figure out what Unicode codepoint a given glyph is supposed to represent! PDFs are much more of a vector format than a layouting program than most people seem to realize.
> One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding)
This is exactly the problem with PDFs: It's not regular mojibake (i.e. interpreting a string of text in the wrong charset), but rather some PDF processor's failed attempt at mapping glyphs back to codepoints without an explicit mapping table being present in the PDF, which is something that the creator actively has to do.
> “Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR.
For the reason above and others, in my experience, OCR actually works significantly better than trying to "semantically parse" the PDF.
Hmm. Not sure what I'm doing that's special but both latex pdfs I produce and others that I read generally work just fine with pypdf, and I really am not adding any flags at all (my makefile says I just go
). Maybe latexmk is adding some magic?\usepackage{cmap} is usually what does that:
> The cmap package provides character map tables, which make PDF files generated by pdfLATEX both searchable and copy-able in acrobat reader and other compliant PDF viewers.
(from https://ctan.org/pkg/cmap)
I don't use cmap or pdflatex. Weird.