OCR is fine for books which are just swathes of text, but for things like magazines it breaks down heavily. You have columns breaking in weird places, going up, down, left, right, fonts changing in the middle of a paragraph. And then the pages are heavy on images which the text is often referencing either explicitly or implicitly. Without the images, the meaning of the text is often changed or redundant.
Anyone have an LLM that can take a 300 page PDF magazine (with no OCR) and summarize it? :)