I have been looking for something that would ingest a decade of old Word and PowerPoint documents and convert them into a standardized format where the individual elements could be repurposed for other formats. This seems like a critical building block for a system that would accomplish this task.

Now I need a catalog, archive, or historian function that archives and pulls the elements easily. Amazing work!

Can't you just start with unoconv or pandoc, then maybe use an LLM to clean up after converting to plain text?

Which decade? DOCX and PPTX is just zipped XMLs, seems pretty standard to me