You have a problem with badly behaved scrapers, not AI.
I can't disagree with being against badly behaved scrapers. But this is neither a new problem or an interesting one from the idea of making information freely available to everyone, even rhinoceroses, assuming they are well behaved. Blocking bad actors is not the same thing as blocking AI.
The thing is that rhinoceroses aren't well-behaved. Even if some small fraction of them in theory might be well-behaved, the effort of trying to account for that is too small to bother. If 99% of rhinoceroses aren't well-behaved, the simple and correct response is to ban them all, and then maybe the nice ones can ask for a special permit. You switch from allow-by-default to block-by-default.
Similarly it doesn't make sense to talk about what happens if AI bots were well-behaved. If they are, then maybe that would be okay, but they aren't, so we're not talking about some theoretical (or past) situation where bots were well-behaved and scraped in a non-disruptive fashion. We're talking about the present reality in which there actually are enormous numbers of badly-behaved bots.
Incidentally, I see that in a lot of your responses on this thread you keep suggesting that people's problem is "not with AI" but with something else. But look at your comment that I initially replied to:
> Blocking AI training bots is not free and open for all.
We're not talking about "AI". We're talking about AI training bots. If people want to develop AI as a theoretical construct and train it on datasets they download separately in a non-disruptive way, great. (Well, actually it's still terrible, but for other reason. :-) ) But that's not what people are responding to in this thread. They're talking about AI training bots that scrape websites in a way that is objectively more harmful than previous generations of scrapers.
A rhino can't not be huge and destructive and humans can't not be shitty and selfish. Badly behaved scrapers are simply an inevitable fact of the universe and there's no point trying to do anything because it's an immutable law of reality and can never be changed, so don't bother to try
ISPs are supposed to disconnect abusive customers. The correct thing to do is probably contact the ISP. Don't complain about scraping, complain about the DDOS (which is the actual problem and I'm increasingly beginning to believe the intent.)
Great! How do I get, say, Google's ISP to disconnect them?
Every ISP has an abuse email contact you can look up.
Sure, let me just contact that one ISP located in Russia or India, I am sure they will care a lot about my self-hosted blog
Hence the need for Cloudflare?
I am not comfortable with a private company being the only solution, especially when they have a history of deplatforming sites.
Except that's exactly what you should do. And if they refuse to cooperate you contact the network operators between them and yourself.
Imagine if Chinese or Russian criminal gangs started sending mail bombs to the US/EU and our solution would be to require all senders, including domestic ones, to prove their identity in order to have their parcels delivered. Completely absurd, but somehow with the Internet everyone jumps to that instead of more reasonable solutions.
The internet is not a mirror of the real world.
But many people feel that the very act of incorporating your copyrighted words into their for-profit training set is itself the bad behavior. It's not about rate-limiting scrapers, it's letting them in the door in the first place.
Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.
The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
They were okay with it when Google was sending them traffic. Now they often don’t. They’ve broken the social contract of the web. So why should the sites whose work is being scraped be expected to continue upholding their end?
Not only are they scraping without sending traffic, they're doing so much more aggressively than Google ever did; Google, at least, respected robots.txt and kept to the same user-agent. They didn't want to index something that a server didn't want indexed. AI bots, on the other hand, want to index every possible thing regardless of what anyone else says.
There's something more obviously nefarious and existential about AI. It takes the idea of "you are the product" to a whole new level.
> Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
It wasn't okay, it's just that the reasons it wasn't okay didn't become apparent until later.
> The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
Many of those people will likely have a problem with it later, for reasons that are happening now but that they won't become fully aware of until later.
[flagged]
[flagged]
[flagged]
Sure. But we're already talking about presumption of free and open here. I'm sure people are also reading my words and incorporating it into their own for-profit work. If I cared, I wouldn't make it free and open in the first place.
But that is not something you can protect against with technical means. At beast you can block the little fish and give even more power to the mega corporations who will always have a way to get to the data - either by operating crawlers you cannot afford to block, incentivizing users to run their browsers and/or extensions that collect the data and/or buying the data from someone who does.
All you end up doing is participating in the enshittification of the web for the rest of us.
Badly behaved scrapers are not a new problem, but badly behaved scrapers run by multibillion-dollar companies which use every possible trick to bypass every block or restriction or rate limit you put in front of them is a completely new problem on a scale we've never seen before.
> You have a problem with badly behaved scrapers, not AI.
And you have a problem understanding that "freedom and openness" extend only to where the rights (e. g. the freedom) of another legal entity begins. When I don't want "AI" (not just the badly-behaved subset) rifling my website then I should be well within my rights to disallow just that, in the same way as it's your right to allow them access to your playground. It's not rocket science.
This is not what the parent means. What they mean is such behavior is a hypocrisy. Because you are getting access to truly free websites whose owners are interested in having smart chatbots trained on the free web, but you are blocking said chatbots while touting "free Internet" message.
AI is one of those bad actors