My YOShInOn RSS reader uses an SBERT model for classification (will I upvote this or not?) and large-scale clustering (20 k-means clusters and show me the top N in each cluster so I get a diversity of different articles.)

For duplicate detection I am using DBSCAN

https://scikit-learn.org/stable/modules/generated/sklearn.cl...

and found some parameters where I get almost no false positives but a lot of duplicates get missed when I lowered the threshold to make clusters I started getting false positives fast. I don't find duplicates are a big problem in my system with the 110 feeds I have and the subjects I am interested in, but insofar as they are a problem there tend to be structured relationships between articles: that is, site A syndicates articles from site B but for some reason articles from site A usually get selected and site B articles don't. An article from Site A often links to one or more articles, often that I don't have a feed for, and it would be nice if the system looked at the whole constellation. Stuff like that.

Effective clustering is the really interesting technology Google News has had for a long time.

I have been attempting this exact sort of clustering solution for a few years now (on and off as a side project). Do you have source code available, or more detailed explanations/resources of how to approach this?

Edit: I just looked around for your YOShInOn RSS reader code and couldn't find it. I did find a number of references it looks like you've made to it on various forums, etc over the years.

The technical report on YOShInOn is about 2 years overdue!

You mean the k-means for diversity or DBSCAN for duplicates? Either way it is about 10 lines of scikit-learn code. Send me an email.

Both. Just sent an email. Thanks!