The problem is what is the alternative? I'm (not) defending them or this practice by any measure, but we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system. I've hated CAPTCHAs ever since I first encountered them and I can't wait for them to just finally die a permanent death, but I also don't know how we solve the "how do you identify a human and a bot" in a way which doesn't require server admins to have extremely beefy servers or similar setups to handle the extra load. I'm not going to do the "there HAS to be a way thing" either because, for all I know, this could just be one of those impossible-to-solve problems.
> we all know what happens if you just open your site up without these, especially with AI bots which hammer servers and are in effect a legalized DDoS system
No, we don't know. I honestly do not understand the problem. I run websites, both static and non-static. Granted, my sites aren't exactly the most popular internet go-to destinations, but I should be seeing this DDoS too, right?
I do see lots of requests. Nothing that any modern system can't handle. Computers are stupid fast these days. Unless you are doing something unreasonable, it's really hard to even notice this "extra load".
I understand there are sites for whom this causes problems, but I think these are rare and could be optimized not to do unreasonable things.
I think too many people are annoyed by AI companies (arguably understandable position), look at their logs and speak of "hammering", "DDoS" and "extra load", while in reality it doesn't matter much.
I second this. My website exposes a cgit and 99% of the traffic now is AI scraping the sources, but the load is nowhere near DoS territory. And this is running on the cheapest VPS I could find.
Not saying I'm not annoyed by the scraping; I am looking to block them, but I'm also not going to put the site behind the gatekeeper. If anything, Cloudflare must love AI scraping now for the same reason AV companies love malware.
Now, if you are running a PHP stack...yeah, maybe that's the problem right there.
We do know, just ask anyone who runs a more popular site or does anything where abuse can be monetized (shopping, reviews, etc.). Avoiding that due to obscurity isn’t an answer because it’s saying you’re safe until something, possibly outside of your control, causes the bots to descend and give you an extra 500M requests with no chance of revenue.
I’m with OP: I don’t like this but the alternatives all look like the death of the open web.
> just ask anyone who runs a more popular site
The person you're responding to already said they ran a modestly sized site. What actual scale opens one up to abuse? If only the top 1% of sites need it, then it seems silly to say "everyone" needs it.
It might depend on the tech stack. I run a small niche website but it has PHP and a database (MediaWiki/PHPBB) and without Cloudflare I'd estimate I'd need to spend several hundred dollars a month to handle the traffic. Traffic used to be tens of thousands of requests a day. AI has increased that to between 400k and 3M requests per day but it's not a smooth distribution. This is with bot fight mode on that greatly reduces traffic.
I adopted Cloudflare because it was getting DDoSed by the AI crawlers. I'm pretty sure all of them are vibe coding their crawlers and don't bother adding rate limiting as a requirement.
Has anyone pointed an AI scraper at your server at all? Unless your website appears in search engine listings I don't think the AI scrapers will slam it. My server has never been hit by them but my server is also practically unknown. All of this said, I'm not going to claim that server loads can handle it because many sysadmins have claimed otherwise, and I would like to think that their claims are reliable.
As soon as you get your TLS certificate you get bombarded with scraping. You don't need someone to "point a scraper at you".
What matters most is usually how much there is to scrape. If you have like 5 pages that's nothing. For forum like websites where each thread, each user profile, etc. gets scraped that's when traffic increases. I just let them have at it with no issues though, computers are fast.
Also, how do we even know they're really "AI scrapers", or just a deliberate DDoS to push sites into using CF or other "anti-bot" providers?
A small, single EU country focused non-static e-commerce, with proper robots.txt instructions that worked perfectly well in the search & co bots -only "era" with rate limiting for nginx/php-fpm setup - is kinda struggling without CF to handle 15000 requests per 15 minutes, coming from Chrome "users" from IPv6. Best so far was an avg. server load in htop = 40 on an 8-core server x_x
Block out IPv6 and see if that helps.
> handle 15000 requests per 15 minutes,
that's just ~17 req/sec
That's "cheap VPS running wordpress" level of traffic
You get downvoted for these opinions but I agree. Most people that complain that their servers get hammered by AI bots are those that run very unoptimized servers that can only handle like 100 rps. I've never had any issues with any of my moderately optimized websites. A $10 VPS can handle sooo much traffic.
I think people get annoyed when it's suggested they spend time optimising or even re-writing their websites to handle high traffic loads just to cater to AI bots ripping their content.
It's also not always easy to do. I run a small wiki which is fairly optimised, nearly every page manages at least ~3k rps on a small VPS. The only exception is the diff page which is ~150 rps. Optimising that while still giving good output isn't that easy, but the wiki doesn't have many users so that would be fine if it wasn't for the AI bots.
The AI bots ignore robots.txt and were initially hitting the site with ~1k rps crawling every combination. Even that would be manageable as there's currently ~150,000 combinations, except they kept re-crawling the whole lot each day. The server could manage it but it was a massive waste of resources.
They were using residential IPs and only sending 1 request from each IP making it impossible to block. In the end I gave up and put a Cloudflare challenge in front of it. I don't want to use Cloudflare but the alternative is forcing users to login to view diffs or remove them entirely.
Curious, but how do the bots figure out the combinations? Or do you have links to the diffs from other sites? I assume the diff takes two files in query parameters or something.
I don't think it's just privacy, it also increasingly turns the web itself into a walled garden. The end result is that websites can only ever be accessed by "approved" clients - the latest Chrome, Edge, Safari and if you're lucky Firefox - and nothing else.
The most plausible near-term path is probably micropayments embedded invisibly in AI agents. Your agent that has learned what you value and can make a reasonable decision to allow a micropayment for certain content pays on your behalf without requiring a conscious decision each time, eliminating the mental transaction cost problem entirely. It's the mental transaction cost that arguably led to the failure of the micro payment model back in the early 2000s.
Although the cynical part of me says that this will result in malicious actors trying to trick agents into giving out a bunch of micro payments. There are counter defenses that can help detect and compensate for that, but perhaps the best we will be able to do is prompt user with the default agent recommendation.
We have few dozen websites, from ones doing single digit Mbit to few Gbits.
Never needed it. Just put the worst offenders in penalty bucket and that's usually enough
[dead]