Hacker News

I've been pondering for a while that we need to move away from layout-based written communication. As in, the need to make things look professionally laid out is an anachronism, and is (very) rarely related to comprehension of the actual content.

For example, submissions to regulatory agencies are huge documents; we spend lots of time in (typically) Microsoft Word creating documents that follow a layout tradition. Aside from this time spent (wasted), the downside is that to guarantee that layout for the recipient, the file must be submitted in DOCX or PDF. These formats are then unfriendly if you want to do anything programatically with them, extract raw data, etc. And of course, while LLMs can read such files, there's likely a significant computational overhead vs. a file in a simple machine-readable format (e.g. text, markdown, XML, JSON).

---

An alternative approach would be to adopt a very simple 'machine first', or 'content first' format - for example, based on JSON, XML, even HTML - with minimum metadata to support strurcture, intra-document links, and embedding of images. For human comsumption, a simple viewer app would reconstitute the file into something more readable; for machine consumption, the content is already directly available. I'm well aware that such formats already exist - HTML/browsers, or EPUB/readers, for example - the issue is to take the rational step towards adopting such a format in place of the legacy alternatives.

I'm hoping that the LLM revolutoion will drive us in just this direction, and that in time, expensive parsing of PDFs is a thing of the past.

xp84 3 days ago [ - ]

I’m with you on PDF, but is docx really that bad in practice? I have not implemented a parser for it so I’m not pushing one answer to that. But it seems like it’s an XML-based format that isn’t about absolutely positioning everything unless you explicitly decide to, and intuitively it seems like it should be like an 80 on the parsing easiness scale if a JPEG is a 0, a PDF is a 15, and a markdown is 100.

grues-dinner 3 days ago [ - ]

The docx standard, which was rather tendentiously named Office Open XML back when OpenOffice was still called that, is 5000 page long and that's only Part 1 of ECMA-376, with another 1500 pages of "Transitional OOXML" in Part 4 which is basically Word-specific quirks.

Anon_troll 3 days ago [ - ]

Extracting text from DOCX is easy. Anything related to layout is non-trivial and extremely brittle.

To get the layout correct, you need to reverse engineer details down to Word's numerical accuracy so that content appears at the correct position in more complex cases. People like creating brittle documents where a pixel of difference can break the layout and cause content to misalign and appear on separate pages.

This will be a major problem for cases like the text saying "look at the above picture" but the picture was not anchored properly and floated to the next page due to rendering differences compared to a specific version of Word.

Zardoz84 3 days ago [ - ]

Docx it's a proprietary format. So it's a direct no

pointlessone 3 days ago [ - ]

PDF doesn’t have to be bad. Tagged PDF can represent document structure with a decent variety of elements, including alternative text for objects. Proper text encoding can give a good representation of all the ligatures and such. All of this is a part of the spec since 2001. The fact that modern software produces PDFs that are barely any better than a series of vector images is totally on the producers of that software.

phaistra 2 days ago [ - ]

Sounds like you are describing markdown.