Hacker News

anitil 20 hours ago [ - ]

My go-to for this is to screenshot and use the built-in text extraction in the screenshot tool (I'm on a mac), then pass on that text data to whatever processing. It's a pretty good tool so long as the PDF is in OK shape (I've had errors in scanned images).

nradov 20 hours ago [ - ]

It's so horrible that in 2026 people are still publishing important data and specifications in a format like PDF that's difficult for LLMs to consume. We need to drag them kicking and screaming to HTML or Markdown. Heck, even Microsoft Word DOCX is superior for reliable parsing and content extraction.

dannyw 12 hours ago [ - ]

Good luck, getting rid of PDFs is going to be as hard as migrating from JPEG everywhere.