I do OCR of images, and that's exactly what I do. I take one big image and slice it into many smaller ones, and send those to the LLM. Perfect every time, unlike using the whole image which resulted in hot garbage.

It works with relatively good scans, when there are bad/skewed scans and especially something with many label/value pairs, that aren't nicely tucked inside sentences, the more context you have, the more you can find the correct words and fix the errors.

There is a whole class of tricky documents. A decent (if you ignore the marketing bias) post about this problem can be found here:

https://getomni.ai/blog/ocr-benchmark

How do you know where to slice an image? What if you slice an image mid-word?

I calculate* the appropriate overlap and the slicer overlaps a certain amount of the previous slice. There is some post-processing assembly required, but it's trivial.

[*] SWAG line height, trial and error to figure out the right amount of overlap given LLM error rates, etc.

Interesting. Do you have a uniform data set? E.g. documents of a specific type that you know consistently have similar formats, or is this training something you need to do per-document?