Hacker News

I just spent a few weeks testing about 25 different pdf engines to parse files and extract text.

Only three of them can process all 2500 files i tried (which are just machine manuals from major manufacaturers, so not highly weird shit) without hitting errors, let alone producing correct results.

About 10 of them have a 5% or less failure rate on parsing the files (let alone extracting text). This is horrible.

It then goes very downhill.

I'm retired, so i have time to fuck around like this. But going into it, there is no way i would have expected these results, or had time to figure out which 3 libraries could actually be used.