Hacker News

https://github.com/johnwatson11218/LatentTopicExplorer/

This tool allows you to read in a collection of pdf files and it will use AI to create clusters of documents based on semantic content. Pdfplumber -> postgresql -> 1 doc 2 many pages -> pages embedded using sentence_transformers -> 384 reduced to 2d for plotting using UMAP -> custers identified using HDBScan -> labels identifed using class based tf-idf implemented inside of postgresql using stored procedures. Once the files are read in the system is more stable and reliable. My personal document collection in about 1200 files and the ingestion takes 12 to 24 hours on my various laptops. Once that is done the website is very fast.