my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well?
A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect
You almost don't want [super-]word level ML (ie word-pair/phrase/sentence/document/corpus level).
In transcription, you want near certainty, or you want marking that the word could not be read with certainty - yes, context lets you guess, but you want - for some OCR - to know when it's a guess based on other than the letters in order forming a word.
Example, in a census document on familysearch.com the transcriber "corrected" a name as Joseph. The literal letters in the handwritten document spell Josepth ... and sure enough that's a local variant spelling (Eire).
In another document the writer has used "Joh" as an abbreviation, a [human, I assume] transcriber put that as John ... which is most likely, but happens to be wrong.
Sometimes you care that it's guessed, sometimes you want just the best guess.
> Eire
A nitpick, because it's often a dogwhistle: but almost nobody in Ireland calls it that when speaking English. And that's still incorrect in Irish, the correct spelling is Éire.
By saying it's a dogwhistle are you saying that not adding the correct diacritics is considered racist by Irish people? If I change the rest of the sentence to Na Gaeilge will that be better.
I've been trying out this model on a 4090 to transcribe a Japanese grammar pdf (written in English with lots of Japanese examples) and it seems to be working very well from the small parts I have double checked. The output contains both the kanji/hiragana and English as appropriate without attempting any translation.
It has converted about 200 pages in an hour.
If I would want to achieve 100% recognition results I would combine this method with an image model recreating the original document from the transcribed text and matching the layout. One can do that with using all but the page or paragraph from the document you want to recreate (to avoid recreating the exact passage under test from the image artifact directly). After reconstructing you can do an optical comparison that specifically matches misaligned characters and find the errors. Rinse and repeat. Expensive but it would guarantee 100% recognition.
I'm curious about this. What models/tools have you been using?