Hacker News

I don't think any of those uses a ligature. Ü, é and Þ are distinct characters in legacy latin-1 and in Unicode. It wouldn't surprise me if non-scandinavian websites do not like Þ, however.

It's probably not PDF's fault that parsers are choking on the ff ligature. Changing all those parsers isn't practical, and Adobe can't make that happen.

Finally, if you run based on metadata that isn't visible, you open up to a different kind of problem, where a visual inspection of the PDF is different from the parsed data. If I'm writing something to automatically classify PDFs from the wild, I want to use the visible data. A lot of tools (such as Paperless) will ocr a rasterized pdf to avoid these inconsistencies.