Hacker News

Even if training on the copyrighted material is OK, just providing a data dump of it almost certainly is not.

No need for a data dump, just list all URLs or whatever else of their training data sources. Afaik that's how the LAION training dataset was published.

anonymoushn 3 days ago [ - ]

providing a large list of bitrotted URLs and titles of books which the user should OCR themselves before attempting to reproduce the model doesn't seem very useful.

echoangle 3 days ago [ - ]

Aren't the datasets mostly shared in torrents? They probably won't bitrot for some time.

Wowfunhappy 3 days ago [ - ]

...no? They also use web crawlers.

bee_rider 2 days ago [ - ]

The datasets are collected using web crawlers, but that doesn’t tell us anything about how they are stored and re-distributed, right?

Wowfunhappy 2 days ago [ - ]

Why would you store the data after training?

bee_rider 2 days ago [ - ]

Are you saying that you know they don’t store the data after training?

I’d just assume they did because—why scrape again if you want to train a new model? But if you know otherwise, I’m not tied to this idea.

Wowfunhappy 2 days ago [ - ]

I'm also assuming. But I would ask the opposite question: why store all that data if you'll have to scrape again anyway?

You will have to scrape again because you want the next AI to get trained on updated data. And, even at the scale needed to train an LLM, storing all of the text on the entire known internet is a very non-trivial task!

anonymoushn a day ago [ - ]

If you try to reproduce various open datasets like fineweb by scraping the pages again, you can't, because a lot of the pages no longer exist. That's why you would prefer to store them instead of losing the content forever.

It's not "all of the text", it's like less than 100 trillion tokens, which means less than 400TB assuming you don't bother to run the token streams through a general purpose compression algorithm before storing them.

3 days ago [ - ]

[deleted]