Hacker News

If you don't have a known set of PDF producers this is really the only way to safely consume PDF content. Type 3 fonts alone make pulling text content out unreliable or impossible, before even getting to PDFs containing images of scans.

I expect the current LLMs significantly improve upon the previous ways of doing this, e.g. Tesseract, when given an image input? Is there any test you're aware of for model capabilities when it comes to ingesting PDFs?