Hacker News

many people tend to overlook how little information is needed for successful de-anonymization.

i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):

"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."

and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.

i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.

DalasNoin 11 hours ago [ - ]

That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)

john_strinlai 10 hours ago [ - ]

>we make a pretty direct comparison in section 5

awesome, i saw the mention in the introduction but i havent yet had a chance for a thorough read through of the paper -- ive just skimmed it. looking forward to reading it in-depth!

Jerrrrrrrry 9 hours ago [ - ]

Throwaway accounts using "clever" turns of phrase can often be anonymized by double click, right-clicking -> googling their witty pun and seeing their the sole instance elsewhere, on Twitter, Facebook, etc

If I see a couple words I dont know in a row, I can infer a posters real name.

Id be more specific but any example is doxxing, literally so

SchemaLoad 6 hours ago [ - ]

If you have access to the whole site dataset it's much more reliable with simpler checks. You can just use word usage frequency of common words. Someone posted a demo here of doing this to HN comments which was very effective at showing alt accounts for a user.

plagiarist 6 hours ago [ - ]

I assume one's vocabulary is basically a fingerprint, even if one doesn't use unique turns of phrase. Domain knowledge just leaks in and we aren't conscious of it being identifiable.