Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.

Off the top of my head: for a lot of OCR tasks, it’s kind of worse for the model to be smart. I don’t want my OCR to make stuff up or answer questions — I want to to recognize what is actually on the page.

Sometimes what is on the page is ambiguous. Imagine a scan where the dot over the i is missing in a word like "this". What's on the page is "thls" but to transcribe it that way would be an error outside of forensic contexts.

I am reminded it's basically impossible to read cursive writing in a language you don't know even if it's the same alphabet.

Interesting. Won't stuff like entity extraction suffer? Especially in multilingual use cases. My worry is that a smaller model might not realize some text is actually a persons name because it is very unusual.

The model does not need to be that smart to understand that a name it does not know that starts with a capital letter is a the name of a place or a person. It does not need to be aware of whom this refers to, it just needs to transcribe it.

Also, there are generalist models that have enough of a grasp of a dozen or so languages that fit comfortably in 7B parameters. Like the older Mistral, which had the best multi-lingual support at the time, but newer models around that size are probably good candidates. I am not surprised that a multilingual specialised model can fit in 8B or so.

No. Gemini is clearly the leader across the board: https://www.ocrarena.ai/leaderboard