Hacker News

Full disclaimer: I work at Nanonets

Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:

LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.

Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.

Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.

Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.

Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.

Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.

Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s

Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

mvac 16 days ago [ - ]

Correct link for Docext: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...

RicoElectrico 16 days ago [ - ]

Could be it used to (maybe with help of a downstream LLM) parse a photo/PDF of a restaurant menu into a JSON file conforming to a schema? Or would bigger, hosted multimodal LLMs work better in such case?

arkh 15 days ago [ - ]

So it feels like it finally let me do one thing I'd wanted for some time: scan printed documents and generate structured pdfs (and not pdf as a picture container).

wisdomseaker 15 days ago [ - ]

Would any of this be able to handle magazine layouts? I've yet to find anything that can follow their fairly random layouts with text at varying angles etc

uselesswords 15 days ago [ - ]

Have you found it has better accuracy or scales with larger models? Or are the improvements, if any, marginal compared to the 3B VLM model?

gibsonf1 16 days ago [ - ]

Does it hallucinate with the LLM being used?

michaelt 16 days ago [ - ]

Sometimes. I just fed the huggingface demo an image containing some rather improbable details [1] and it OCRed "Page 1000000000000" with one extra trailing zero.

Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.

Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.

[1] https://imgur.com/a/8rJeHf8

nattaylor 16 days ago [ - ]

The base model is Qwen2.5-VL-3B and the announcement says a limitation is "Model can suffer from hallucination"

Seems a bit scary that the "source" text from the pdfs could actually be hallucinated.

prats226 16 days ago [ - ]

Given that input is image and not raw pdf, its not completely unexpected

generalizations 16 days ago [ - ]

Does it have a way to extract the images themselves, or is that still a separate process later?

j45 16 days ago [ - ]

If you are after extracting images from pdfs there’s plenty of tools that do that just fine without LLMs.

I mean, ideally it would be in context, so the generated markdown references the correct image at the correct location in the doc. Unless that's what you're talking about? In which case I don't know about those tools.

16 days ago [ - ]

[deleted]