Hacker News

The folks who are using RAG, what's the SOTA for extracting text from pdf documents? I have been following discussions on HN and I have seen a few promising solutions that involve converting pdf to png and then doing extraction. However, for my application this looks a bit risky because my pdfs have tons of tables and I can't afford to get in return incorrect of made up numbers.

The original documents are in HTML format and although I don't have access to them I can obtain them if I want. Is it better to just use these HTML documents instead? Previously I tried converting HTML to markdown and then use these for RAG. I wasn't too happy with the result although I fear I might be doing something wrong.

gillesjacobs 5 days ago [ - ]

Extracting structure and elements from HTML should be trivial and probably has multiple libraries in your programming language of choice. Be happy you have machine-readable semantic documents, that's best-case scenario in NLP. I used to convert the chunks to Markdown as it was more token-efficient and LLMs are often heavily preference trained on Markdown, but not sure with current input pricing and LLM performance gains that matters anymore.

If you have scanned documents, last I checked Gemini Flash was very good cost/performance wise for document extraction. Mistral OCR claims better performance in their benchmarks but people I know used it and other benchmarks beg to differ. Personally I use Azure Document Intelligence a lot for the bounding boxes feature, but Gemini Flash apparently has this covered too.

https://getomni.ai/blog/ocr-benchmark

Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.

Good RAG is multimodal and semantic document structure and layout-aware so your pipeline needs to extract and recognize text sections, footers/headers, images, and tables. When working with PDFs you want accurate bounding boxes in your metadata for referring your users to retrieved sources etc.

mingtianzhang 5 days ago [ - ]

Yeah, thanks for pointing out the OCR! We also found that for complex PDFs, you first need to use OCR to convert them into Markdown and then run PageIndex. However, most OCR tools process each page independently, which causes them to lose the overall document structure. For example, existing OCR tools often generate incorrect heading levels, which is a big problem if you want to build a tree structure from them. You could check out PageIndex-OCR, the first long-context OCR model that can produce Markdown with more accurate heading-level recognition.

gillesjacobs 5 days ago [ - ]

I am always on the lookout for new document extraction tools, but can't seem to find any benchmarks for PageIndex-OCR. There are several like OmniDocBench and readoc. So... Got benchmark?

malshe 5 days ago [ - ]

> Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.

Got it. Indeed, I need to do End-to-End Document Parsing/Extraction.

giamma 5 days ago [ - ]

How about using something like Apache Tika for extracting text from multiple documents? It's a subproject of Lucene and consists of a proxy parser + delegates for a number of document formats. If a document, e.g. PDF, comes from a scanner, Tika can optionally shell-out a Tesseract invocation and perform OCR for you.

huqedato 5 days ago [ - ]

The Tika's documentation is abysmal. Maybe it is a great product but we had to scrap it because of this.

leetharris 5 days ago [ - ]

In our benchmarks, https://github.com/datalab-to/marker is the best if you need to deploy it on your own hardware.

malshe 5 days ago [ - ]

Thanks! I will check this out.

JJax7 5 days ago [ - ]

If accuracy is a major concern, then it's probably guaranteed better to go with the HTML documents. Otherwise, I've heard Docling is pretty good from a few co-workers.

malshe 5 days ago [ - ]

So you suggest working directly with HTML or going HTML -> Markdown first?

mingtianzhang 5 days ago [ - ]

Our PageIndex for HTML will be open-sourced next week, we are actually working on that!

z3ugma 5 days ago [ - ]

extractous is worth a look if it's real text

If it's an image / you need to OCR it, Gemini Flash is so good and so cheap that I've had good luck using it as a "meta OCR" tool

malshe 5 days ago [ - ]

I will try it out. Is this the correct library? - https://github.com/yobix-ai/extractous

I have used Gemini for OCR and it was indeed good. I also used GPT 3.5 and liked that too.

mingtianzhang 5 days ago [ - ]

You could also try PageIndex OCR, the first long-context OCR model. Most current OCR tools process each page independently, which causes them to lose the document’s structure and produce markdown with incorrect heading levels. PageIndex OCR generates markdown with more accurate heading levels to better capture the document’s structure.

malshe 5 days ago [ - ]

Ok, thanks for sharing. I will take a look.

kcb 5 days ago [ - ]

I've used nv-ingest and Nvidia's nemoretriever-parse model.

davidajackson 5 days ago [ - ]

Can you explain why to png? why not to markdown?

malshe 5 days ago [ - ]

Oh, I totally think markdown is better than converting to png and then doing OCR. Maybe I did not use a good HTML to markdown converter. The HTML documents are really long and the markdown converter broke down a few times. But as I mentioned, this is probably on me as I did not do a good job of finding a better HTML to markdown converter.