Training on private user interactions is a privacy violation; training on public, published texts is (some argue) an intellectual property violation. They're very different kinds of moral rights.
Training on private user interactions is a privacy violation; training on public, published texts is (some argue) an intellectual property violation. They're very different kinds of moral rights.
Have Anthropic ever written clearly exactly about what training datasets they use? Like a list of everything included? AFAIK, all the providers/labs are kind of tightly lipped about this, so I think it's safe to assume they've slurped up all data they've come across via multiple methodologies, "private" or not.
Look at the suits against them they list it there
Are there complete lists in the suits? Last time I skimmed them, they contained allegations of sources, and some admissions like The Pile, LibGen, Books3, PiLiMi, scanned books, web scrapes and some other sources I don't remember, but AFAIK there isn't any complete inventory of training datasets they used.
I wish I could be so optimistic that there is no private information published unintentionally or maliciously on the open web where crawlers can find it.
(and as diggan said, the web isn't the only source they use anyway. who knows what they're buying from data brokers.)