This is my first encounter with Scribe.js; since I have many book scans I always try OCRing them when I see this. Compared to Tesseract (which is the best I have so far), it gets the words right slightly more, but the paragraph segmentation is many times worse. On a book where every paragraph is indented, it reliably decides two consecutive one-line paragraphs are the same paragraph, which is understandable, but a downgrade from Tesseract which gets the paragraph segmentation as correct as possible (It doesn't handle paragraphs that spanpage-breaks, since I'm feeding it one page at a time)

Scribe is Tesseract. It uses tesseract.js which is a Web Assembly port of Tesseract. So they in theory should be equal. In practice custom settings or older versions could make a difference.

This is only true in the "speed" mode; in the "quality" mode it claims better word recognition than Tesseract on clean scans (which matches my tests): https://github.com/scribeocr/scribe.js/blob/master/docs/scri...

What's the motivation for doing this in the browser? It seems like intentionally choosing a more difficult path to create an inferior result.

A native MacOS or Windows application could use the OCR facilities of the operating system and, in my experience, both produce results that are far better than Tesseract.

Generate the OCR on the fly, in the browser, when you do not have the proper OCR info. As someone that works on public web libraries, I see it useful (but wasteful)

> Tesseract (which is the best I have so far)

Have you looked at EasyOCR?

EasyOCR is significantly worse than Tesseract for clean printed text and , while being orders of magnitude slower; far better than Tesseract for low-quality clean scans and extracting text from pictures (e.g. comics), which Tesseract does not as well.

Have you tried Abbyy FineReader? It's the best OCR package I've seen.

It doesn't seem to have a Linux version; I don't have a mac or windows machine.

[deleted]