Hacker News

I think it's a useful insight for people working on RAG using LLMs.

Devs working on RAG have to decide between parsing PDFs or using computer vision or both.

The author of the blog works on PdfPig, a framework to parse PDFs. For its document understanding APIs, it uses a hybrid approach that combines basic image understanding algorithms with PDF metadata . https://github.com/UglyToad/PdfPig/wiki/Document-Layout-Anal...

GP's comment says a pure computer vision approach may be more effective in many real-world scenarios. It's an interesting insight since many devs would assume that pure computer vision is probably the less capable but also more complex approach.

As for the other comments that suggest directly using a parsing library's rendering APIs instead of rasterizing the end result, the reason is that detecting high-level visual objects (like tables , headings, and illustrations) and getting their coordinates is far easier using vision models than trying to infer those structures by examining hundreds of PDF line, text, glyph, and other low-level PDF objects. I feel those commentators have never tried to extract high-level structures from PDF object models. Try it once using PdfBox, Fitz, etc. to understand the difficulty. PDF really is a terrible format!