Hacker News

The PDF reader for Gemini is extraordinarily poor in my experience. I like the writing style of this model a little better, but for most tasks people would use AI for, Gemini is probably not what you want to be using.

trees101 16 hours ago [ - ]

what is a good way to read PDFs using AI?

seanhunter 15 hours ago [ - ]

In my experience it really depends on what sort of pdfs you are trying to extract (ie what the content is).

For regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text and for those I’ve had a lot of success on general pdfs using pypdf.

“Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR. At the moment my personal rag pipeline is doing this using a local Gemma4 model (you could use something else).

Either way I do an audit post-ingest where I select a random set of pages and also get the local gemma model to try those same set and compare. The symptoms to look out for here will depend a lot on what you’re trying to extract but I’m extracting maths mostly so I get the model to check extraction of symbols, equations etc. One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding) as this almost always catches pdfs that have just extracted as pure garbage. I added this step because I was ingesting a lot of old maths pdfs which have specialist notation that wasn’t always getting correctly ingested and as they were image pdfs it was coming in as pure garbage. So the fix here is to use a specialist ocr service (I have been using “mathpix” which has been great and isn’t too expensive if you don’t want to do too much).

The other thing that can cause problems is things like tables (eg if you were trying to ingest a lot of pdfs like financials of companies etc). Those can cause problems for both the ocr and the pure text extraction methods. I don’t have a current recommendation for that because I haven’t done it recently enough and the state of the art has moved a lot. It’s something to be aware of that will require special treatment though.

lxgr 9 hours ago [ - ]

> regular pdfs that have been produced in a “normal way” (ie using latex or a modern application with a “save to pdf” function) will contain the text

Producing "normal PDFs" that way actually requires specific LaTeX options to be enabled in my experience. Without that, PDF viewers have to perform all kinds of ugly hacks to even figure out what Unicode codepoint a given glyph is supposed to represent! PDFs are much more of a vector format than a layouting program than most people seem to realize.

> One thing I have consistently found useful is to look for “mojibake” (scrambled text caused by decoding in an unintended character encoding)

This is exactly the problem with PDFs: It's not regular mojibake (i.e. interpreting a string of text in the wrong charset), but rather some PDF processor's failed attempt at mapping glyphs back to codepoints without an explicit mapping table being present in the PDF, which is something that the creator actively has to do.

> “Image” Pdfs that have been produced via a scan so don’t actually contain a text transcript require actual OCR.

For the reason above and others, in my experience, OCR actually works significantly better than trying to "semantically parse" the PDF.

seanhunter 6 hours ago [ - ]

Hmm. Not sure what I'm doing that's special but both latex pdfs I produce and others that I read generally work just fine with pypdf, and I really am not adding any flags at all (my makefile says I just go

   latexmk --lualatex -aux-directory=output -output-directory=output $<

). Maybe latexmk is adding some magic?

lxgr 5 hours ago [ - ]

\usepackage{cmap} is usually what does that:

> The cmap package provides character map tables, which make PDF files generated by pdfLATEX both searchable and copy-able in acrobat reader and other compliant PDF viewers.

(from https://ctan.org/pkg/cmap)

seanhunter 5 hours ago [ - ]

I don't use cmap or pdflatex. Weird.

lostsock 16 hours ago [ - ]

I have a standing instruction for any documents that can't natively be read by a given AI to first be converted into .md using https://github.com/microsoft/markitdown which I've found to work really well

wwn_se 13 hours ago [ - ]

Doing a preprocess using some pdf extraction and ocr tool and then feeding that to the big model is usually way more stable.

chrsw 9 hours ago [ - ]

In the broadest sense, I don't think we're there yet. I asked an SoC vendor to provide their chip documentation in Markdown. They refused. So, I went ahead and tried to do myself with AI.

I tried various AI tools and the results ranged from absolute garbage to something-but-not-something-but-not-quite.

I went ahead and did a section of a huge PDF by hand, just to see if what I was asking for was even feasible. After more than several hours of painstaking work spread across multiple days, I got several chapters to look identical to the source PDF in some Markdown renderers. I had to use some HTML for the more complex tables. I converted some diagrams to Markdown and some to images linked to from the Markdown.

rawoke083600 11 hours ago [ - ]

MinerU works well to get it markdown