Actually debugging a PDF parsing issue as we speak and actually started writing a parser (partially to understand the issue, partially as a last resort as the code in the parser I was debugging felt a bit shoddy).

The PDF format is frankly quite horrible, extended over the years by kludges that feels more or less like premature optimizations in some cases and bloated overkill in others.

While theoretically a nice idea, the issue is that there is just so many damn object types with specialized properties inside a PDF that you'd basically end up with all complications of a FFI for each binding you'd do to expose a sane subset.

Theoretically one could perhaps make a canonical PDF<->JSON or similar mapping from an established library that most PDF data consumers/generators could use if memory usage isn't too constrained (because the underlying object model isn't entirely dissimilar).

You can do:

  cpdf -output-json in.pdf -o out.json
(Modify out.json as liked)

  cpdf -j out.json -o out.pdf
(Disclaimer, I wrote it.)