Have Anthropic ever written clearly exactly about what training datasets they use? Like a list of everything included? AFAIK, all the providers/labs are kind of tightly lipped about this, so I think it's safe to assume they've slurped up all data they've come across via multiple methodologies, "private" or not.
Look at the suits against them they list it there
Are there complete lists in the suits? Last time I skimmed them, they contained allegations of sources, and some admissions like The Pile, LibGen, Books3, PiLiMi, scanned books, web scrapes and some other sources I don't remember, but AFAIK there isn't any complete inventory of training datasets they used.