> This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world.
One of the biggest benefits of PDFs though is that they can contain invisible data. E.g. the spec allows me to embed cryptographic proof that I've worked at the companies I claim to have worked at within my resume. But a vision-based approach obviously isn't going to be able to capture that.
Cryptographic proof of job experience? Please explain more. Sounds interesting.
If someone told me there was cryptographic proof of job experience in their PDF, I would probably just believe them because it’d be a weird thing to lie about.
In theory your (old) boss could sign part of your CV with a certificate obtained from any CA participating in Adobe's AATL programme. If you use the software right, you could have different ranges signed by different people/companies. Because only a small component gets signed, you'd need them to sign text saying "Jane Doe worked at X corp and did their job well" as a signed line like "software developer" can be yanked out and placed into other PDF documents (simplifying a little here).
I'm not sure if there's software out there to make that process easy, but the format allows for it. The format also allows for someone to produce and sign one version and someone else to adjust that version and sign the new changes.
Funnily enough, the PDF signature actually has a field to refer to a (picture of) a readable signature in the file, so software can jot down a scan of a signature that automatically inserts cryptographic proof.
In practice I've never seen PDFs signed with more than one signature. PDF readers from anyone but Adobe seem to completely ignore signatures unless you manually open the document properties, but Adobe Reader will show you a banner saying "document signed by XYZ" when you open a signed document.
Encrypted (and hidden) embedded information, e. g. documents, signatures, certificates, watermarks, and the like. To (legally-binding) standards, e. g. for notary, et cetera.
What software can be used to write and read this invisible data? I want to document continuous edits to published documents which cannot show these edits until they are reviewed, compiled and revised. I was looking at doing this in word, but we keep word and PDF versions of these documents.
If that stuff is stored as structured metadata extracting that should be trivial
Yeah we don't handle this yet.