Seems wildly counterintuitive to me frankly.
Even if true though not sure what we’d do with it. The bulk of knowledge available on the internet is text. Aside from maybe YouTube so I guess it could work for world model type things? Understanding physical interactions of objects etc
All text is technically converted to images before we see it.
Only if you see it instead of hearing it or touching it.
Trivial to convert text to images to process. But counter-intuitive to me too.