Have Anthropic ever written clearly exactly about what training datasets they use? Like a list of everything included? AFAIK, all the providers/labs are kind of tightly lipped about this, so I think it's safe to assume they've slurped up all data they've come across via multiple methodologies, "private" or not.

[deleted]

Look at the suits against them they list it there

Are there complete lists in the suits? Last time I skimmed them, they contained allegations of sources, and some admissions like The Pile, LibGen, Books3, PiLiMi, scanned books, web scrapes and some other sources I don't remember, but AFAIK there isn't any complete inventory of training datasets they used.