Hacker News

There's a huge difference between parsing a PDF and parsing the contents of a PDF. Parsing PDF files is its own hell, but because PDFs are basically "stuff at a given position" and often not "well-formed text within boundary boxes", you have to guess what letters belong together if you want to parse the text as a word.

If you're interested in helping out the resume parsers, take a look at the accessibility tree. Not every PDF renderer generates accessible PDFs, but accessible PDFs can help shitty AI parsers get their names right.

As for the ff problem, that's probably the resume analyzer not being able to cope with non-ASCII text such as the ﬀ ligature. You may be able to influence the PDF renderer not to generate ligatures like that (at the expense of often creating uglier text).