Hacker News

Doesn't rendering to an image require proper parsing of the PDF?

PDF is more like a glorified svg format than a word format.

It only contains info on how the document should look but so semantic information like sentences, paragraphs, etc. Just a bag of characters positioned in certain places.

Macha 3 days ago [ - ]

Sometimes the characters aren’t even characters, just paths

cylemons 2 days ago [ - ]

Wouldn't that be very space inefficient to repeat the paths every time a letter appears in the file? Or you mean that glyph Ids don't necessarily map to Unicode?

user____name a day ago [ - ]

Outlines are just a practical way of handling less common display cases.

Just to give a practical example. Imagine a Star Wars advert that has the Star Wars logo at the top, specified in outlines because that's what every vector logo uses. Below it the typical Star Wars intro text stretched into perspective, also using outlines, because that's the easiest (display engine doesn't need complicated transformation stack), efficient to render (you have to render the outlines anyway), and most robust way (looks the same everywhere), way of implementing transformations in text. You also don't have to supply the font file, which comes with licensing issues, etc. Also whenever compositing and transparency are involved, with color space conversion nonsense, it's more robust to "bake" the effect via constructive geometry operations, etc, to prevent display issues on other devices, which are surprisingly common.

manishsn 2 days ago [ - ]

sometimes in fancy articles you might see the first letter is large and ornate which is most likely a path also like you said glyph IDs always don't necessarily map to unicode or the creator can intentionally mangle the 'to unicode' map of Identity-H embedded font in the pdf if he is nasty

throwaway4496 3 days ago [ - ]

Yes, and don't for a second think this approach of rastering and OCR'ing is sane, let alone a reasonable choice. It is outright absurd.

yxhuvud 3 days ago [ - ]

Noone has claimed getting structured data out of pdfs are sane. What you seem to be missing is that there are no sane ways to get a decent output. The reasonable choice would be to not even try, but business needs invalidate that choice. So what remain is the absurd ways to solve the problem.

throwaway4496 3 days ago [ - ]

[flagged]

yxhuvud 3 days ago [ - ]

Well, perhaps you are exposed only to special snowflakes of pdfs that are from a single source and somewhat well formed and easy to extract from. Other, like me, are working at companies that also have lots of PDFs, from many, many different sources, and there are no easy ways to extract structured data or even text in a way that always work.

throwaway4496 2 days ago [ - ]

If you actually read what I have been saying and commenting, you would realise how silly your comment is.