Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?

My experience is that “text is actually images or paths” is closer to the 40% case than the 1% case.

So you could build an approach that works for the 60% case, is more complex to build, and produces inferior results, but then you still need to also build the ocr pipeline for the other 40%. And if you’re building the ocr pipeline anyway and it produces better results, why would you not use it 100%?