Terms of service specifically prohibits this.

How much of the training set comes from websites with "no automated scraping" in their terms?

The companies stole that data from the world, so I don't see why we couldn't take it back.

It's a nice sentiment. The companies with the integrations are the ones that could take it back, but they don't have the incentive to break legal agreements and share with the world.

Meanwhile the creative output of humanity is distilled into black boxes to benefit those who can scrape it the most and burn the most power, but this impact is distributed amongst everyone, so again there's little incentive among those who could create (likely legal) change.