Vision tokens would only be a viable alternative to text if/when the LLM had learnt to read, and was able to control the page scanning - how to segment the page into sections of text and non-text, segment the text sections into lines, scan the lines in language-specific direction (left to right, or right to left), segment into words, etc - basically everything that an OCR program needs to do prior to the actual OCR bit.
Even having learnt to do all of this, or perhaps with a page-of-text sequence-of-word extractor pre-processor, the LLM would then need to learn to generalize over different font faces and sizes, and imperfect speckled and/or distorted scans.
Finally, but surely not least, if the goal is to reduce (inference?) computational load by representing multiple words as a single image token, then it seems that more training epochs may be needed, with variations in word grouping, since the same sequence of words would not always be grouped together, so the LLM would have to learn that an image token representing "the cat sat" may also have been split up as "today the cat" and "sat on the".
A better way to reduce number of tokens being processed might be to have the LLM learn how to combine multiple adjacent tokens into one, perhaps starting with individual letters at the input, although this would of course require a fairly major change to the Transformer architecture.
Multimodal LLMs already learn to generalize over text inside images. In my experience most multimodal LLMs are significantly better than traditional OCR, especially if there's any unusual formatting going on.
This thread is considering image input as an alternative to text input for text, not as an alternative to other types of OCR, so the accuracy bar is 100%.
I've had mixed results with LLMs for OCR.. sometimes excellent (zero errors on a photo of my credit card bill), but poor if the source wasn't a printed page - sometimes "reusing" the same image section for multiple extracted words!
FWIW, I highly doubt that LLMs have just learnt to scan pages from (page image, page text) training pairs - more likely text-heavy image input is triggering special OCR handling.