OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?
I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.
OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.
It absolutely hasn't been solved, it's just got pretty decent in recent years.
Pretty decent might be quiet the stretch. I'd term it almost acceptable, but only if you're using commercial solutions like amazon's textract, doing it with open source tools is at best, extremely painful and vaguely accurate.
PaddleOCR (also from Baidu) is pretty damn good actually.
I have shipped with PaddleOCR to prod. Works pretty well. (Usage limited to printed documents in Anglosphere). Runs fully offline, in CPU.
Is it? I've never seen a single OCR that would replace a human just typing it by hand.
What if the goal is something actually useful, such as converting scientific paper PDF back to LaTeX that renders into a pixel-perfect copy? What about converting tables from electronics datasheets into computer-readable form? I wouldn't even expect it in the next decade.
I've had success with vision models & OCR, saved me many hours / days / weeks of typing work.
Last year I finally OCR'd many hundreds of pages of my father's old writings. I found that feeding it to Claude Sonnet 4.x via API gave me results that were perfect. No corrections required. So perfect, that Claude was reading along with the story, and actually pointed out a continuity error in the story where an incorrect character was reciting dialog. Claude asked if it should transcribe exactly as is or if I would like Claude to correct the continuity error.
Claude also correctly OCR'd some handwriting that was in the margins of the documents. Sonnet came very close to transcribing a Word Sleuth puzzle, but that was where I hit the limits of its capability at the time.
Mistral OCR was also good (and actually what I started with), but it wasn't quite as good as Claude. And when it was wrong, Mistral could be frighteningly wrong - one API call must have failed, the model must have been presented with a pure black / null image, and I got back a "transcription" that described neverending darkness. It read like something the Woodsman would have broadcast in Twin Peaks S3E8. That poor model.
Tables from electronics datasheets might be okay, I think I've had success with OCR of technical manuals with tables for 80s synthesizer hardware. But I admit my use cases don't crossover into transcriptions of equations or graphs.
Detecting characters almost, layout no.
Exactly my experience. If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. Vision-llms can improve a bit on character recognition, but at the cost of harder to detect failure modes.
But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)
Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen
Real question: what tool do you use? (for long/complex documents with tables, code, maths)
- marker (with --force-ocr) gives me the best results
- Mistral OCR (seems really great, but I never managed to get it work)
- Mathpix (tried a long time ago)
- docling (gives me garbage, I must use it wrong)
- Unlimited OCR (will try it)
- ???
- Azure Document Intelligence (has an option to return markdown too including headers and footers).
- AWS Textract
Exactly. They're both very expensive and prone to surprising you. Sometimes in a good way, sometimes in a bad way. I'd rate them 85%, but you have to run a test because they both fail in different ways on the 15%.
poma-ai has really great chunking techniques that chunk the document based on the document structure/heirarchy.
We use it on 200 page IEEE standards that are notoriously complex, filled with tables and diagram. Highly reccomend.
I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):
"A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."
Aside: what is the best to read receipts/bank statements/invoices?
I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.
This is already used in OCR, tesseract uses that.
lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go
> I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.
Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.
Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...
[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
Cost, throughput, latency...
Traditional OCR is faster, cheaper, and much more reliable than LLMs
If you consider non-English script, traditional OCR is not more reliable.
CJK have lots of character and high confusion rate.
Arabic scripts are complex and have lots of morphs.
Vietnamese have easily confused diacritics.
Thai have lots of non-standard fonts.
I don't think that's a universal statement that aplies to every kind of documents and languages. Mistral OCR is able to do things no "traditional" OCR was ever able to.
I wish it were. Alas...
OCR has definitely not "been solved long time ago", what are you talking about?
In your opinion, what is SOTA here?