I'm using this approach quite often. I don't know of any documents created by humans for humans that have no formatting. The formatting, position etc. are usually an important part of the document.
Since the first multimodal llms came out, I'm using this approach when I deal with documents. It makes the code much simpler because everything is an image and it's surprisingly robust.
Works also for embeddings (cohere embed v4)