Hacker News

I’m with you on PDF, but is docx really that bad in practice? I have not implemented a parser for it so I’m not pushing one answer to that. But it seems like it’s an XML-based format that isn’t about absolutely positioning everything unless you explicitly decide to, and intuitively it seems like it should be like an 80 on the parsing easiness scale if a JPEG is a 0, a PDF is a 15, and a markdown is 100.

grues-dinner 3 days ago [ - ]

The docx standard, which was rather tendentiously named Office Open XML back when OpenOffice was still called that, is 5000 page long and that's only Part 1 of ECMA-376, with another 1500 pages of "Transitional OOXML" in Part 4 which is basically Word-specific quirks.

Anon_troll 3 days ago [ - ]

Extracting text from DOCX is easy. Anything related to layout is non-trivial and extremely brittle.

To get the layout correct, you need to reverse engineer details down to Word's numerical accuracy so that content appears at the correct position in more complex cases. People like creating brittle documents where a pixel of difference can break the layout and cause content to misalign and appear on separate pages.

This will be a major problem for cases like the text saying "look at the above picture" but the picture was not anchored properly and floated to the next page due to rendering differences compared to a specific version of Word.

Zardoz84 3 days ago [ - ]

Docx it's a proprietary format. So it's a direct no