Just to be clear, as with LAION, the data set doesn't contain personal data.
It contains links to personal data.
The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material. Folks can argue if that's true but the discussion should at least be transparent.
I think the data set is generally considered to consist of the images, not the list of links for downloading the images.
That the data set aggregator doesn't directly host the images themselves matters when you want to issue a takedown (targeting the original image host might be more effective) but for the question "Does that mean a model was trained on my images?" it's immaterial.
It does matter? When implemented as a reference, the image can be taken down and will no longer be included in training sets*. As a copy, the image is eternal. What’s the alternative?
* Assuming the users regularly check the images are still being hosted (probably something that should be regulated)
The data set is a list of ("descriptive text", URL) tuples.
As with almost any URL, it is not in and of itself an image.
As an aside, this presents a problem for researchers because the links can resolve to different resources, or no resource at all, depending on when they are accessed.
Therefore this is not a static dataset on which a machine learning model can be trained in a guaranteed reproducible fashion.
I think you may be missing the point: The title says "AI training data set", which is the result of downloading the linked images. The list of tuples is just how this training dataset is distributed.
The issue in question is that many/most large generative AI models were trained with personal data.
Ah! Yes, thanks very much, I misread.
That's a distinction without a difference. Just as with LAION, anyone using this data set is going to be downloading the images and training on them, and the potential harms to the affected users are the same.
LAION was alleged to link to CSAM. If LAION didn't link and instead hosted/contained/distributed the actual files, I think there would be a much higher chance that someone distributing LAION could serve prison time, at least in the USA.
That seems like a pretty big difference to me.
When the model is trained, are the links not resolved to fetch whatever the point to, and that goes into the model?
Secondly, privacy and copyright are different. Privacy is more of a concern with how information is used than getting credit and monetization for being the author.
no, normally your training pipeline wouldn't involve running bittorrent
If the training set contained BitTorrent magnet links to the desired information (e.g. images whose pixels are to be trained on, then, yes, it would have to.
Upthread it was mentioned that the training data representation contained links to material; magnet links were mentioned in passing as an example of something supposedly not violating copyright. It wasn't stated that training data contained magnet links. (Did it?)
You sure about that? https://arstechnica.com/tech-policy/2025/07/meta-pirated-and...
Links to pii are by far the worst sort of pii, yes.
“It’s not his actual money, it’s just his bank account and routing number.”
A more accurate analogy is "it's not his actual money, it's a link to a webpage or image that has his bank account and routing number."
My contention is that links to pii are themselves pii.
A name, Jon Smith, is technically PII but not very specific. If I have a link to a specific Jon Smith’s facebook page or his HN profile, it’s even more personally identifiable than knowing his name is Jon Smith.
That is crazy. The target of a link could change, which means that all links are Schroedinger's PII.
And if a link to PII is PII, then a link to a link to PII is PII, and thus all links are PII unless it links to the dark (unlinked) Web
Well yes, by knowing my bank account and routing number, you don't have access to my money.
You do in the US.
That sounds insecure. Looks like blockchain and private keys with extra hops. Perhaps you can easily revert banking transactions...
> The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material.
I interpret that the article is about AI being trained on personal data. That is a big break of many countries legislation.
And AI is 100% being trained in copyrighted data too. Breaking another different set of laws.
That shows how much big-tech is just breaking the law and using money and influence to get away with it.
"Ladies and gentlemen of the jury, my client did not rob that bank. He only made a Google Maps link to directions to the bank, a link to an Imgur image containing the vault's combination, and a link to a Pastebin with instructions on how to disable the security system available. He merely packaged that information together and made it publicly available in a single source in a format only really useful to robbers for the purpose of robbery training. It's twoo hward to actually look at the information one is compiling and releasing to the public and to expect even a microscopically minuscule cursory amount of minimal effort to that end is unreasonable. He is clearly innocent."
What do you think they’d be charged with in this situation?
It wouldn’t be bank robbery.