Full disclaimer: I work at Nanonets
Excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.). Key Features:
LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.
Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like , , and for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.
Huggingface / GitHub / Try it out: https://huggingface.co/nanonets/Nanonets-OCR-s
Try it with Docext in Colab: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
Correct link for Docext: https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
Could be it used to (maybe with help of a downstream LLM) parse a photo/PDF of a restaurant menu into a JSON file conforming to a schema? Or would bigger, hosted multimodal LLMs work better in such case?
So it feels like it finally let me do one thing I'd wanted for some time: scan printed documents and generate structured pdfs (and not pdf as a picture container).
Would any of this be able to handle magazine layouts? I've yet to find anything that can follow their fairly random layouts with text at varying angles etc
Have you found it has better accuracy or scales with larger models? Or are the improvements, if any, marginal compared to the 3B VLM model?
Does it hallucinate with the LLM being used?
Sometimes. I just fed the huggingface demo an image containing some rather improbable details [1] and it OCRed "Page 1000000000000" with one extra trailing zero.
Honestly I was expecting the opposite - a repetition penalty to kick in having repeated zero too many times, resulting in too few zeros - but apparently not. So you might want to steer clear of this model if your document has a trillion pages.
Other than that, it did a solid job - I've certainly seen worse attempts to OCR a table.
[1] https://imgur.com/a/8rJeHf8
The base model is Qwen2.5-VL-3B and the announcement says a limitation is "Model can suffer from hallucination"
Seems a bit scary that the "source" text from the pdfs could actually be hallucinated.
Given that input is image and not raw pdf, its not completely unexpected
Does it have a way to extract the images themselves, or is that still a separate process later?
If you are after extracting images from pdfs there’s plenty of tools that do that just fine without LLMs.
I mean, ideally it would be in context, so the generated markdown references the correct image at the correct location in the doc. Unless that's what you're talking about? In which case I don't know about those tools.