It might sound absurd, but on paper this should be the best way to approach the problem.
My understanding is that PDFs are intended to produce an output that is consumed by humans and not by computers, the format seems to be focused on how to display some data so that a human can (hopefully) easily read them. Here it seems that we are using a technique that mimics the human approach, which would seem to make sense.
It is sad though that in 30+ years we didn't manage to add a consistent way to include a way to make a PDF readable by a machine. I wonder what incentives were missing that didn't make this possible. Does anyone maybe have some insight here?
> It might sound absurd, but on paper this should be the best way to approach the problem.
On paper yes, but for electronic documents? ;)
More seriously: PDF supports all the necessary features, like structure tags. You can create a PDF with the basically the same structural information as an HTML document. The problem is that most PDF-generating workflows don’t bother with it, because it requires care and is more work.
And yes, PDF was originally created as an input format for printing. The “portable” in “PDF” refers to the fact that, unlike PostScript files of the time (1980s), they are not tied to a specific printer make or model.
Probably for the same reason images were not readable by machines.
Except PDFs dangle hope of maybe being machine-readable because they can contain unicode text, while images don't offer this hope.
1. It's extra work to add an annotation or "internal data format" inside the PDF.
2. By the time the PDF is generated in a real system, the original data source and meaning may be very far off in the data pipeline. It may require incredible cross team and/or cross vendor cooperation.
3. Chicken and egg. There are very few if any machine parseable PDFs out there, so there is little demand for such.
I'm actually much more optimistic of embedding meta data "in-band" with the human readable data, such as a dense QR code or similar.
That is a really interesting idea. Did some napkin math:
Consumer printers can reliably handle 300 Dots Per Inch (DPI). Standard letter paper is 8.5” x 11” and we need a 0.5” margins on all sides to be safe. This gives you a 7.5” x 10” printable area, which is 2250 x 3000 Dots. Assume 1 Dot = 1 QR Code module (cell) and we can pack 432 Version 26 QR codes onto the page (121 modules per side; 4 modules quiet space buffer between them).
A version 26 QR code can store 864 to 1,990 alphanumeric characters depending on error correction level. That’s 373,248 to 859,680 characters per page! Probably need maximum error correction to have any chance of this working.
If we use 4 dots per module, we drop down to 48 Version 18 QR codes (6 x 8). Those can hold 452-1046 alphanumeric characters each, for 20,000 - 50,208 characters per page.
Compare that at around 5000 characters per page of typed English. You can conservatively get 4x the information density with QR codes.
Conclusion: you can add a machine-readable appendix to your text-only PDF file at a cost of increasing page count by about 25%.
Also... many PDFs today are not intended to ever meet a dead tree. If that's the case you can put pretty high DPI QR codes there without issue.
Hmm you could do a bunch of crazy stuff if you assume it will stay digital.
You could have an arbitrarily large page size. You could use color to encode more data… maybe stack QR codes using each channel of a color space (3 for RGB, 4 for CMYK)
There are interesting accessibility and interoperability trade offs. If it’s print-ready with embedded metadata, you can recover the data from a printed page with any smart phone. If it’s a 1 inch by 20 ft digital page of CMKY stacked QR codes, you’ll need some custom code.
Playing “Where’s Waldo” with a huge field of QR codes is probably still way more tractable than handling PDF directly though!
Yes, PDFs are primarily a way to describe print data. So to a certain extent the essence of PDF is a hybrid vector-raster image format. Sure, these days text is almost always encoded as or overlaid with actual machine-readable text, but this isn't really necessary and wasn't always done, especially for older PDFs. 15 years ago you couldn't copy (legible) text out of most PDFs made with latex.
> the format seems to be focused on how to display some data so that a human can (hopefully) easily read them
It may seem so, but what it really focuses on is how to arrange stuff on a page that has to be printed. Literally everything else, from forms to hyperlinks, were later additions (and it shows, given the crater-size security holes they punched into the format)
It's Portable Document Format, and the Document refers to paper documents, not computer files.
In other words, this is a way to get a paper document into a computer.
That's why half of them are just images: they were scanned by scanners. Sometimes the images have OCR metadata so you can select text and when you copy and paste it it's wrong.