How about using something like Apache Tika for extracting text from multiple documents? It's a subproject of Lucene and consists of a proxy parser + delegates for a number of document formats. If a document, e.g. PDF, comes from a scanner, Tika can optionally shell-out a Tesseract invocation and perform OCR for you.
The Tika's documentation is abysmal. Maybe it is a great product but we had to scrap it because of this.