Hacker News

Not all machine learning is generative AI.

True but like regular document scanning software there can be errors in detection.

Just as with redacted documents (consistently blocked terms) or bad OCR jobs (wrong or missing characters), even if only a certain percentage comes out unmangled it is more readable than having no data at all.

A stable base corpus and some dynamic programming will allow you to clean up the remainder[0].

[0]: http://stackoverflow.com/a/11642687/2449774

mkl a day ago [ - ]

The problem is when you can't tell which bits are unmangled. OCR systems will happily give you plausible but wrong readings, and even some scanners/copiers will change things: https://dkriesel.com/en/blog/2013/0802_xerox-workcentres_are...

selcuka 18 hours ago [ - ]

Yeah. There was a weird Xerox printer bug that swapped digits (turning 6s into 8s) on scanned documents caused by the JBIG2 image format [1].

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...