Hacker News

Given that #1 seems to be based almost entirely on stealing from #2, and never paying reparations, I’d say it’s pretty unsustainable.

It’s like saying robbing banks for a living isn’t sustainable and working at a bank is. That’s not exactly a stretch.

fc417fc802 3 hours ago [ - ]

#1 may well put #2 out of a living but that isn't the same as stealing and doesn't (at least in and of itself) make it unsustainable. The fact that models were trained on scraped content isn't a matter of technical necessity but rather the path of least resistance (lowest cost in this case). Synthetic data is increasingly used for reasons of quantity, quality, and various technical considerations.

tw04 3 hours ago [ - ]

All of the major players in AI currently, literally stole to build their models. There isn’t one out there that hasn’t. So yes, it is the same as stealing because they were LITERALLY, in the literal sense, stealing.

fc417fc802 2 hours ago [ - ]

Well, pirated. Piracy and stealing aren't the same thing.

Regardless, I acknowledged the general issue. However I pointed out that doing so was not a technical necessity. If you base your worldview or actions around X implying Y but then it turns out that actually Y was merely a matter of convenience you're probably going to arrive at a wrong conclusion.

There's also the issue where you're emphatically calling it stealing without providing a clear criteria. The legal system as a whole has yet to conclusively resolve the various piracy accusations. The legality of consuming publicly available content remains quite controversial.

tw04 2 hours ago [ - ]

It absolutely is a technical necessity. You could build a model from scratch today without doing the same thing. And every model attempting to train on AI generated output degrades into nonsense almost immediately.

There’s a reason Reddit is making millions of dollars letting these companies mine their human generated content. You think OpenAI or anyone else would pay for that if they could just cyclically train on AI generated content???

fc417fc802 2 hours ago [ - ]

> attempting to train on AI generated output

I said nothing about that. Good synthetic data does not (typically) involve ML algorithms. Although that might be changing.

I'll politely suggest that you go read the literature before engaging further.

Reddit, Twitter, and similar are valuable because the data covers current events. Their content makes up a reasonably comprehensive timeline of the world at large. You don't need that to train a barebones functional model but it's certainly useful in order to train a knowledgeable one. Regardless, if they're charging for access it clearly isn't piracy so it doesn't seem like your original objection would hold any water in that case.

tw04 an hour ago [ - ]

> I'll politely suggest that you go read the literature before engaging further.

Which commercial AI vendor has not stolen any content when creating their models? I’ll wait.

Which commercial AI vendor has created their models exclusively training on datasets created and created by other AI?

> Regardless, if they're charging for access it clearly isn't piracy so it doesn't seem like your original objection would hold any water in that case.

Given that they were previously violating the site’s terms of service when scraping the content: yes, they were absolutely stealing.