The folks who are using RAG, what's the SOTA for extracting text from pdf documents? I have been following discussions on HN and I have seen a few promising solutions that involve converting pdf to png and then doing extraction. However, for my application this looks a bit risky because my pdfs have tons of tables and I can't afford to get in return incorrect of made up numbers.
The original documents are in HTML format and although I don't have access to them I can obtain them if I want. Is it better to just use these HTML documents instead? Previously I tried converting HTML to markdown and then use these for RAG. I wasn't too happy with the result although I fear I might be doing something wrong.
Extracting structure and elements from HTML should be trivial and probably has multiple libraries in your programming language of choice. Be happy you have machine-readable semantic documents, that's best-case scenario in NLP. I used to convert the chunks to Markdown as it was more token-efficient and LLMs are often heavily preference trained on Markdown, but not sure with current input pricing and LLM performance gains that matters anymore.
If you have scanned documents, last I checked Gemini Flash was very good cost/performance wise for document extraction. Mistral OCR claims better performance in their benchmarks but people I know used it and other benchmarks beg to differ. Personally I use Azure Document Intelligence a lot for the bounding boxes feature, but Gemini Flash apparently has this covered too.
https://getomni.ai/blog/ocr-benchmark
Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.
Good RAG is multimodal and semantic document structure and layout-aware so your pipeline needs to extract and recognize text sections, footers/headers, images, and tables. When working with PDFs you want accurate bounding boxes in your metadata for referring your users to retrieved sources etc.
Yeah, thanks for pointing out the OCR! We also found that for complex PDFs, you first need to use OCR to convert them into Markdown and then run PageIndex. However, most OCR tools process each page independently, which causes them to lose the overall document structure. For example, existing OCR tools often generate incorrect heading levels, which is a big problem if you want to build a tree structure from them. You could check out PageIndex-OCR, the first long-context OCR model that can produce Markdown with more accurate heading-level recognition.
I am always on the lookout for new document extraction tools, but can't seem to find any benchmarks for PageIndex-OCR. There are several like OmniDocBench and readoc. So... Got benchmark?
> Sidenote: What you want for RAG is not OCR as-in extracting text. The task for RAG preprocessing is typically called Document Layout Analysis or End-to-End Document Parsing/Extraction.
Got it. Indeed, I need to do End-to-End Document Parsing/Extraction.
How about using something like Apache Tika for extracting text from multiple documents? It's a subproject of Lucene and consists of a proxy parser + delegates for a number of document formats. If a document, e.g. PDF, comes from a scanner, Tika can optionally shell-out a Tesseract invocation and perform OCR for you.
The Tika's documentation is abysmal. Maybe it is a great product but we had to scrap it because of this.
In our benchmarks, https://github.com/datalab-to/marker is the best if you need to deploy it on your own hardware.
Thanks! I will check this out.
If accuracy is a major concern, then it's probably guaranteed better to go with the HTML documents. Otherwise, I've heard Docling is pretty good from a few co-workers.
So you suggest working directly with HTML or going HTML -> Markdown first?
Our PageIndex for HTML will be open-sourced next week, we are actually working on that!
extractous is worth a look if it's real text
If it's an image / you need to OCR it, Gemini Flash is so good and so cheap that I've had good luck using it as a "meta OCR" tool
I will try it out. Is this the correct library? - https://github.com/yobix-ai/extractous
I have used Gemini for OCR and it was indeed good. I also used GPT 3.5 and liked that too.
You could also try PageIndex OCR, the first long-context OCR model. Most current OCR tools process each page independently, which causes them to lose the document’s structure and produce markdown with incorrect heading levels. PageIndex OCR generates markdown with more accurate heading levels to better capture the document’s structure.
Ok, thanks for sharing. I will take a look.
I've used nv-ingest and Nvidia's nemoretriever-parse model.
Can you explain why to png? why not to markdown?
Oh, I totally think markdown is better than converting to png and then doing OCR. Maybe I did not use a good HTML to markdown converter. The HTML documents are really long and the markdown converter broke down a few times. But as I mentioned, this is probably on me as I did not do a good job of finding a better HTML to markdown converter.