My experience is that “text is actually images or paths” is closer to the 40% case than the 1% case.

So you could build an approach that works for the 60% case, is more complex to build, and produces inferior results, but then you still need to also build the ocr pipeline for the other 40%. And if you’re building the ocr pipeline anyway and it produces better results, why would you not use it 100%?