While we have a PDF internals expert here, I'm itching to ask: Why is mupdf-gl so much faster than everything else? (on vanilla desktop linux)
Its search speed on big pdfs is dramatically faster than everything else I've tried and I've often wondered why the others can't be as fast as mupdf-gl.
Thanks for any insights!
It's funny you ask this - i have spent a time building pdf indexing/search apps on the side over the past few weeks.
I'll give you the rundown. The answer to your specific question is basically "some of them process letter by letter to put text back in order, and some don't. Some build fast trie/etc based indexes to do searching, some don't"
All of my machine manuals/etc are in PDF, and too many search apps/OS search indexers don't make it simple to find things in them. I have a really good app on the mac, but basically nothing on windows. All i want is a dumb single window app that can manage pdf collections, search them for words, and display the results for me. Nothing more or less.
So i built one for my non-mac platforms over the past few weeks. One version in C++ (using QT), one version in .net (using MAUI), for fun.
All told, i'm indexing (for this particular example), 2500 pdfs that have about 150k pages in them.
On the indexing side, lucene and sqlite FTS do a fine job, and no issues - both are fast, and indexing/search is not limited by their speed or capability.
On the pdf parsing/text extraction side, i have tried literally every library that i can find for my ecosystem (about 25). Both commercial and not. I did not try libraries that i know share underlying text extraction/etc engines (IE there are a million pdfium wrappers).
I parse in parallel (IE files are processed in parallel) , extract pages in parallel (IE every page is processed in parallel), and index the extracted text either in parallel or in batches (lucene is happy with multiple threads indexing, sqlite would rather have me do it sequentially in batches).
The slowest libraries are 100x slower than the fastest to extract text. They cluster, too, so i assume some of them share underlying strategies or code despite my attempt to identify these ahead of time. The current Foxit SDK can extract about 1000-2000 pages per second, sometimes faster, and things like pdfpig, etc can only do about 10 pages per second.
Pdfium would be as fast as the current foxit sdk but it is not thread safe (I assume this is because it's based on a source drop of foxit from before they added thread safety), so all calls are serialized. Even so it can do about 100-200 pages/second.
Memory usage also varies wildly and is uncorrelated with speed (IE there are fast ones that take tons of memory and slow ones that take tons of memory). For native ones, memory usage seems more related to fragmentation than it does it seems related to dumb things. There are, of course, some dumb things (one library creates a new C++ class instance for every letter)
From what i can tell digging into the code that's available, it's all about the amount of work they do up front when loading the file, and then how much time they take to put the text back in content order to give me.
The slowest are doing letter by letter. The fastest are not.
Rendering is similar - some of them are dominated by stupid shit that you notice instantly with a profiler. For example, one of the .net libraries renders to png encoded bitmaps by default, and between it and windows, it spends 300ms to encode/decode it to display. Which is 10x slower than it rasterized it. If i switch it to render to bmp instead, it takes 5ms to encode/decode it (for dumb reasons, the MAUI apis require streams to create drawable images). The difference is very noticeable if i browse through search results using the up/down key.
Anyway, hopefully this helps answer your question and some related ones.
> From what i can tell digging into the code that's available, ..., how much time they take to put the text back in content order... The slowest are doing letter by letter. The fastest are not.
Thank you, that's really helpful.
I hadn't considered content reordering but it makes perfect sense given that the internal character ordering can be anything, as long as the page renders correctly. There's an interesting comp-sci homework project: Given a document represented by an unordered list of tuples [ (pageNum, x, y, char) ], quickly determine whether the doc contains a given search string.
Sometimes I need to search PDFs for a regex and use pdfgrep. That builds on poppler/xpdf, which extracts text >2x slower than mupdf (https://documentation.help/pymupdf/app1.html#part-2-text-ext..., fitz vs xpdf). From this discussion, I'm now writing my own pdfgrep that builds on mupdf.