Hacker News

The problems don't actually exist in the way you think.

When extracting text directly, the goal is to put it back into content order, regardless of stream order. Then turn that into a string. As fast as possible.

That's straight text. if you want layout info, it does more. But it's also not just processing it as a straight stream and rasterizing the result. It's trying to avoid doing that work.

This is non-trivial on lots of pdfs, and a source of lots of parsing issues/errors because it's not just processing it all and rasterizing it, but trying to avoid doing that.

When rasterizing, you don't care about any of this at all. PDFs were made to raster easily. It does not matter what order the text is in the file, or where the tables are, because if you parse it straight through, raster, and splat it to the screen, it will be in the proper display order and look right.

So if you splat it onto the screen, and then extract it, it will be in the proper content/display order for you. Same is true of the tables, etc.

So the direct extraction problems don't exist if you can parse the screen into whatever you want, with 100% accuracy (and of course it doesn't matter if you use AI or not to do it).

Now, i am not sure i would use this method anyway, but your claim that the same problems exist is definitely wrong.