I think the data set is generally considered to consist of the images, not the list of links for downloading the images.

That the data set aggregator doesn't directly host the images themselves matters when you want to issue a takedown (targeting the original image host might be more effective) but for the question "Does that mean a model was trained on my images?" it's immaterial.

It does matter? When implemented as a reference, the image can be taken down and will no longer be included in training sets*. As a copy, the image is eternal. What’s the alternative?

* Assuming the users regularly check the images are still being hosted (probably something that should be regulated)

The data set is a list of ("descriptive text", URL) tuples.

As with almost any URL, it is not in and of itself an image.

As an aside, this presents a problem for researchers because the links can resolve to different resources, or no resource at all, depending on when they are accessed.

Therefore this is not a static dataset on which a machine learning model can be trained in a guaranteed reproducible fashion.

I think you may be missing the point: The title says "AI training data set", which is the result of downloading the linked images. The list of tuples is just how this training dataset is distributed.

The issue in question is that many/most large generative AI models were trained with personal data.

Ah! Yes, thanks very much, I misread.