Hacker News

I have been trying to catch up with recent OCR developments too. My documents have enough special requirements that public benchmarks didn't tell me enough to decide. Instead I'm building a small document OCR project with visualization tools for comparing bounding boxes, extracted text, region classification, etc. GLM-OCR is my favorite so far [1]. Apple's VisionKit is very good at text recognition, and fast, but it doesn't do high level layout detection and it only works on Apple hardware. It's another useful source of data for cross-validation if you can run it.

This project has been pretty easy to build with agentic coding. It's a Frankenstein monster of glue code and handling my particular domain requirements, so it's not suitable for public release. I'd encourage some rapid prototyping after you've spent an afternoon catching up on what's new. I did a lot of document OCR and post-processing with commercial tools and custom code 15 years ago. The advent of small local VLMs has made it practical to achieve higher accuracy and more domain customization than I would have previously believed.

[1] If you're building an advanced document processing workflow, be sure to read the post-processing code in the GLM code repo. They're doing some non-trivial logic to fuse layout areas and transform text for smooth reading. You probably want to store the raw model results and customize your own post-processing for uncommon languages or uncommon domain vocabulary. Layout is also easier to validate if you bypass their post-processing; it can make some combined areas "disappear" from the layout data.