Hacker News

The PP-DocLayoutV3 [1] bounding boxes are pretty good in my experience, if you want boxes around individual document headings or paragraphs. If you want boxes around individual words, similar to what's shown in the Interfaze screen shot [2], Apple has a LiveText "token" model that's proprietary but free/bundled with macOS and iOS. There are easy to use Python bindings here: https://github.com/straussmaximilian/ocrmac

I presume that some otherwise-great OCR models (like Chandra) have terrible bounding boxes because generating good bounding boxes just wasn't a training priority. A lot of people are using OCR models to bulk-process documents without a lot of care for how the layout is preserved. It matters a lot if (e.g.) you want to be able to update and re-print old documents, but it doesn't matter if you are just transcribing whole documents for indexing/chunking/translation.

[1] https://huggingface.co/PaddlePaddle/PP-DocLayoutV3

[2] https://r2public.jigsawstack.com/interfaze/examples/dense_te...

yoeven 14 hours ago [ - ]

For sure there a tons of OCR bounding models and tons of other models like SAM 3 for segmentation.

Interfaze is a more powerful version of them combined into a single model, you can run multi turn tasks like extract all the text and object from this document then translate or generate a report.

It's like getting the best of both worlds from pure DNN/CNN models like Paddle and the flexibility and nuace of an LLM while outperforming both in accuracy.