People always claimed this as a data leak vector but I've always been sceptical. Like just writing style and vocabulary is probably extremely shared among too many people to narrow it down much. (How people that you know could have written this reply?) The counter argument is that he had a very specific style in his mail so maybe this is a special case.
If you have a large enough set to test against and a specific person you are looking for, this is totally doable currently.
Of course it's doable. The question is how reliable the results are.
I wonder if it works on zoomers too. I have noticed a slight mode collapse among this population ;)
It just needs to find the needles in the haystack. Humans can better verify if they're truly needles.
Not just a test set, but enough of a set to search through and compare against. Several pages of in-depth writing isn't anywhere near sufficient, even when limiting the search space to ~10k people.
this is a well-studied field (stylometry). when combining writing styles, vocabulary, posting times, etc. you absolutely can narrow it down to specific people.
even when people deliberately try to feign some aspects (e.g. switching writing styles for different pseudonyms), they will almost always slip up and revert to their most comfortable style over time. which is great, because if they aren't also regularly changing pseudonyms (which are also subject to limited stylometry, so pseudonym creation should be somewhat randomized in name, location, etc.), you only need to catch them slipping once to get the whole history of that pseudonym (and potentially others, once that one is confirmed).
People do change over time, I used to write "ha" after every sentence for some reason
You know, i had a particularly cringy period in which i put "la" at the end of sentences.
Don't throw the baby out with the bathwater. "Ooh, la" sounds really unnatural.
But on a serious note, what did "la" mean in your context? I've never seen this.
It’s a common thing for speakers of Singaporean English to end sentences with la/leh. But no idea if that’s what’s going on here.
In one use case, it is kind of a verbal exclamation point, but it has more meanings and uses than just that. Likely originates from Hokkien, but it has evolved into it is own thing. If you are curious, more details here https://en.wikipedia.org/wiki/Singlish
In Turkish la at the end disrespectfully refers to a male person.
You left off something.
sure, not denying that. my writing style is fairly different now in my 40s than it was in my late teens/early twenties.
but, those changes are usually pretty gradual and relatively small. thats why when attempting to identify someone via writing, you look at several aspects of the writing and not just word choice (grammar, use of specific slang, sentence length, paragraph structure, punctuation, etc.). it is highly unlikely that all aspects of someones writing changes at the same time. simply removing "ha" is inconsequential to identification if not much else changed.
additionally, this data is typically combined with other data/patterns (posting times, username (themes, length, etc.), writing that displays certain types of expertise, and more) to increase the confidence level of correct identification.
Stylometry is okay if you're trying to deanonymize a large enough sample text. A reddit account would be doable. But individual 4chan posts? You barely have enough content within the text limit.