Hacker News

Because PDF is as much a vector graphics format as a document format, you cannot expect the data to be reasonably structured. For example applications can convert text to vector outlines or bitmaps for practical or artistic purposes (anyone who ever had to deal with transparency "flattening" issues knows the pain), ideally they also encode the text in a seperate semantic representation. But many times PDF files are exported from "image centric" programs with image centric workflows (e.g. Illustrator, CorelDraw, Indesign, QuarkXpress, etc) where the main issue being solved for is presentational content, not semantic. For example if I receive a Word document and need to layout it so it fits into my multi column magazine layout I will take the source text and break it into seperate sections which then manually get copy and pasted into InDesign. You can import the document directly but for all kinds of practical reasons this is not the default way of working. Some asides and lists might be broken out of the original flow of text and placed in their own text field, etc. So now you lost the original semantic structure. Remember, this is how desktop publishing evolved: for print, which has no notion of structure or metadata embedded into the ink or paper. Another common usecase is to simply have resolution independent graphics, again, display purposes only, no structured data is required nor expected.