Kinda funny.

Printing a PDF and scanning it for an email it would normally be worthy of major ridicule.

But you’re basically doing that to parse it.

I get it, have heard of others doing the same. Just seems damn frustrating that such is necessary. The world sure doesn’t parse HTML that way!

I've built document parsing pipelines for a few clients recently, and yeah this approach yields way superior results using what's currently available. Which is completely absurd, but here we are.

I've done only one pipeline trying parse actual PDF structure and the least surprising part of it is that some documents have top-to-bottom layout and others have bottom-to-top, flipped, with text flipped again to be readable. It only goes worse from there. Absurd is correct.

That means you have to put the text (each infividual letter) into its correct place by rendering pdf, but doesnt justify actual OCR which goes one step further and back by rendering and backguessing the glyphs. But thats just text, tables and structure are also somewhere there to be recovered.

Jesus Christ. What other approaches did you try?

Maybe not literally that, but the eldritch horrors of parsing real-world HTML are not to be taken lightly!

If the html in question would include javascript that renders everything, including text, into a canvas -- yes, this is how you would parse it. And PDF is basically that

The analogy doesn't work tho. If you print a PNG and scan it for an email you will be ridiculed. But OCRing a PNG is perfectly valid.