Hacker News

1. It's extra work to add an annotation or "internal data format" inside the PDF.

2. By the time the PDF is generated in a real system, the original data source and meaning may be very far off in the data pipeline. It may require incredible cross team and/or cross vendor cooperation.

3. Chicken and egg. There are very few if any machine parseable PDFs out there, so there is little demand for such.

I'm actually much more optimistic of embedding meta data "in-band" with the human readable data, such as a dense QR code or similar.

dotancohen 3 days ago [ - ]

  > Chicken and egg. There are very few if any machine parseable PDFs out there, so there is little demand for such.

No, the egg has been laid for quite some time. There's just not enough chicken. Almost every place I've worked at has complained about the parsability of PDF files until I showed them LibreOffice's PDF export feature, that supports PDF/A (arciveable), PDF/UA (Universal Accessibility), and embedding the original .odt file in the PDF itself. That combo format has saved so many people so much headache, I don't know why it is not more widely known.

pbronez 3 days ago [ - ]

That is a really interesting idea. Did some napkin math:

Consumer printers can reliably handle 300 Dots Per Inch (DPI). Standard letter paper is 8.5” x 11” and we need a 0.5” margins on all sides to be safe. This gives you a 7.5” x 10” printable area, which is 2250 x 3000 Dots. Assume 1 Dot = 1 QR Code module (cell) and we can pack 432 Version 26 QR codes onto the page (121 modules per side; 4 modules quiet space buffer between them).

A version 26 QR code can store 864 to 1,990 alphanumeric characters depending on error correction level. That’s 373,248 to 859,680 characters per page! Probably need maximum error correction to have any chance of this working.

If we use 4 dots per module, we drop down to 48 Version 18 QR codes (6 x 8). Those can hold 452-1046 alphanumeric characters each, for 20,000 - 50,208 characters per page.

Compare that at around 5000 characters per page of typed English. You can conservatively get 4x the information density with QR codes.

Conclusion: you can add a machine-readable appendix to your text-only PDF file at a cost of increasing page count by about 25%.

actionfromafar 3 days ago [ - ]

Also... many PDFs today are not intended to ever meet a dead tree. If that's the case you can put pretty high DPI QR codes there without issue.

pbronez 2 days ago [ - ]

Hmm you could do a bunch of crazy stuff if you assume it will stay digital.

You could have an arbitrarily large page size. You could use color to encode more data… maybe stack QR codes using each channel of a color space (3 for RGB, 4 for CMYK)

There are interesting accessibility and interoperability trade offs. If it’s print-ready with embedded metadata, you can recover the data from a printed page with any smart phone. If it’s a 1 inch by 20 ft digital page of CMKY stacked QR codes, you’ll need some custom code.

Playing “Where’s Waldo” with a huge field of QR codes is probably still way more tractable than handling PDF directly though!