Hacker News

protomikron 3 days ago [ - ]

I think there is some kind of misunderstanding. Sure, if you get somehow structured, machine-generated PDFs parsing them might be feasible.

But what about the "scanned" document part? How do you handle that? Your PDF rendering engine probably just says: image at pos x,y with size height,width.

So as parent says you have to OCR/AI that photo anyway and it seems that's also a feasible approach for "real" pdfs.

throwaway4496 3 days ago [ - ]

Okay, this sounds like "because some part of the road is rough, why don't we just drive in the ditch along the road way all the way, we could drive a tank, that would solve it"?

Macha 3 days ago [ - ]

My experience is that “text is actually images or paths” is closer to the 40% case than the 1% case.

So you could build an approach that works for the 60% case, is more complex to build, and produces inferior results, but then you still need to also build the ocr pipeline for the other 40%. And if you’re building the ocr pipeline anyway and it produces better results, why would you not use it 100%?