Can't you just start with unoconv or pandoc, then maybe use an LLM to clean up after converting to plain text?