Multimodal LLMs already learn to generalize over text inside images. In my experience most multimodal LLMs are significantly better than traditional OCR, especially if there's any unusual formatting going on.

This thread is considering image input as an alternative to text input for text, not as an alternative to other types of OCR, so the accuracy bar is 100%.

I've had mixed results with LLMs for OCR.. sometimes excellent (zero errors on a photo of my credit card bill), but poor if the source wasn't a printed page - sometimes "reusing" the same image section for multiple extracted words!

FWIW, I highly doubt that LLMs have just learnt to scan pages from (page image, page text) training pairs - more likely text-heavy image input is triggering special OCR handling.