Hacker News

souvik3333 16 days ago [ - ]

Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.

mgr86 16 days ago [ - ]

Have you considered XML. TEI, for example, is very robust and mature for marking up documents.

esafak 16 days ago [ - ]

First I heard of it. https://en.wikipedia.org/wiki/Text_Encoding_Initiative

mgr86 16 days ago [ - ]

Understandable. I work in academic publishing, and while the XML is everywhere crowd is graying, retiring, or even dying :( it still remains an excellent option for document markup. Additionally, a lot of government data produced in the US and EU make heavy use of XML technologies. I imagine they could be an interested consumer of Nanonets-OCR. TEI could be a good choice as well tested and developed conversions exist to other popular, less structured, formats.

agoose77 16 days ago [ - ]

Do check out MyST Markdown (https://mystmd.org)! Academic publishing is a space that MyST is being used, such as https://www.elementalmicroscopy.com/ via Curvenote.

(I'm a MyST contributor)

viraptor 16 days ago [ - ]

Do you know why myst got traction, instead of RST which seems to have all the custom tagging and extensibility build in from the beginning?

agoose77 15 days ago [ - ]

MyST Markdown (the MD flavour, not the same-named Document Engine) was inspired by ReST. It was created to address the main pain-point of ReST for incoming users (it's not Markdown!).

As a project, the tooling to parse MyST Markdown was built on top of Sphinx, which primarily expects ReST as input. Now, I would not be surprised if most _new_ Sphinx users are using MyST Markdown (but I have no data there!)

Subsequently, the Jupyter Book project that built those tools has pivoted to building a new document engine that's better focused on the use-cases of our audience and leaning into modern tooling.

jxramos 16 days ago [ - ]

maybe even epub, which is xhtml

lukev 16 days ago [ - ]

Yeah this really hurts. If your goal is to precisely mark up a document with some structural elements, XML is strictly superior to Markdown.

The fact that someone would go to all the work to build a model to extract the structure of documents, then choose an output format strictly less expressive than XML, speaks poorly of the state of cross-generational knowledge sharing within the industry.

prats226 16 days ago [ - ]

I think the choice mainly stems from how you want to use the output. If the output is going to get fed to another LLM, then you want to select markup language where 1) the grammer would not cause too many issues with tokenization 2) which LLM has seen a lot in past 3) generates minimal number of tokens. I think markdown fits it much better compared to other markup languages.

If goal is to parse this output programmatically, then I agree a more structured markup language is better choice.

jtbayly 16 days ago [ - ]

What happens to footnotes?

souvik3333 16 days ago [ - ]

They will be extracted in a new line as normal text. It will be the last line.

jtbayly 16 days ago [ - ]

So I’m left to manually link them up?

Have you considered using something like Pandoc’s method of marking them up? Footnotes are a fairly common part of scanned pages, and markdown that doesn’t indicate that a footnote is a footnote can be fairly incomprehensible.

agoose77 15 days ago [ - ]

I am lazily posting this all over the thread, but do check out MyST Markdown too! https://mystmd.org. We handle footnotes as a structured object.