People outside of a really small sysadmin niche really don't grasp the scale of this problem.
I run a small-but-growing boutique hosting infrastructure for agency clients. The AI bot crawler problem recently got severe enough that I couldn't just ignore it anymore.
I'm stuck between, on one end, crawlers from companies that absolutely have the engineering talent and resources to do things right but still aren't, and on the other end, resource-heavy WordPress installations where the client was told it was a build-it-and-forget-it kind of thing. I can't police their robots.txt files; meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL), there are about 6 different pretty aggressive AI bots, and occasionally they'll get stuck on some site's product variants or categories pages and start hitting it at a 1r/s rate.
There's an invisible caching layer that does a pretty nice job with images and the like, so it's not really a bandwidth problem. The bots aren't even requesting images and other page resources very often; they're just doing tons and tons of page requests, and each of those is tying up a DB somewhere.
Cumulatively, it is close to having a site get Slashdotted every single day.
I finally started filtering out most bot and crawler traffic at nginx, before it gets passed off to a WP container. I spent a fair bit of time sampling traffic from logs, and at a rough guess, I'd say maybe 5% of web traffic is currently coming from actual humans. It's insane.
I've just wrapped up the first round of work for this problem, but that's just buying a little time. Now, I've gotta put together an IP intelligence system, because clearly these companies aren't gonna take "403" for an answer.
I might write a blog post on this, but I seriously believe we collectively need to rethink The Cathedral and the Bazaar.
The Cathedral won. Full stop. Everyone, more or less, is just a stonecutter, competing to sell the best stone (i.e. content, libraries, source code, tooling) for building the cathedrals with. If the world is a farmer's market, we're shocked that the farmer's market is not defeating Walmart, and never will.
People want Cathedrals; not Bazaars. Being a Bazaar vendor is a race to the bottom. This is not the Cathedral exploiting a "tragedy of the commons," it's intrinsic to decentralization as a whole. The Bazaar feeds the Cathedral, just as the farmers feed Walmart, just as independent websites feed Claude, a food chain and not an aberration.
The Cathedral and the Bazaar meets The Tragedy of the Commons.
Let's say there's two competing options in some market. One option is fully commercialized, the other option holds to open-source ideals (whatever those are).
The commercial option attracts investors, because investors like money. The money attracts engineers, because at some point "hacker" came to mean "comfortable lifestyle in a high COL area". The commercial option gets all the resources, it gets a marketing team, and it captures 75% of the market because most people will happily pay a few dollars for something they don't have to understand.
The open source option attracts a few enthusiasts (maybe; or, often, just one), who labor at it in whatever spare time they can scrape together. Because it's free, other commercial entities use and rely on the open source thing, as long it continues to be maintained in something that, if you squint, resembles slave labor. The open source option is always a bit harder to use, with fewer features, but it appeals to the 25% of the market that cares about things like privacy or ownership or self-determination.
So, one conclusion is "people want Cathedrals", but another conclusion could be that all of our society's incentives are aligned towards Cathedrals.
It would be insane, after all, to not pursue wealth just because of some personal ideals.
This is pretty much a more eloquent version of what I was about to write. It's dangerous to take a completely results oriented view of a situation where the commercial incentives are so absurdly lopsided. The cathedral owners spend more than the GDP of most countries every year on various carrots and sticks to maintain something like the current ecosystem. I think the current world is far from ideal for most people, but it's hard to compete against the coordinated efforts of the richest and most powerful entities in the world.
The answer is quite simply that where complexity exceeds the regular person's interest, there will be a cathedral.
It's not about capitalism or incentives. Humans have cognitive limits and technology is very low on the list for most. They want someone else to handle complexity so they can focus on their lives. Medieval guilds, religious hierarchies, tribal councils, your distribution's package repository, it's all cathedrals. Humans have always delegated complexity to trusted authorities.
The 25% who 'care about privacy or ownership' mostly just say they care. When actually faced with configuring their own email server or compiling their own kernel, 24% of that 25% immediately choose the cathedral. You know the type, the people who attend FOSDEM carrying MacBooks. The incentives don't create the demand for cathedrals, but respond to it. Even in a post-scarcity commune, someone would emerge to handle the complex stuff while everyone else gratefully lets them.
The bazaar doesn't lose because of capitalism. It loses because most humans, given the choice between understanding something complex or trusting someone else to handle it, will choose trust every time. Not just trust, but CYA (I'm not responsible for something I don't fully understand) every time. Why do you think AI is successful? I'd rather even trust a blathering robot than myself. It turns out, people like being told what to do on things they don't care about.
Isn't this the licensing problem? Berkeley release BSD so that everyone can use it, people do years of work to make it passable, Apple takes it to make macOS and iOS because the license allows them to, and then they have both the community's work and their own work so everyone uses that.
The Linux kernel is GPLv2, not GPLv3, so vendors distribute binary blob drivers/firmware with their hardware and then the hardware becomes unusable as soon as they stop publishing new versions because then to use the hardware you're stuck with an old kernel with known security vulnerabilities, or they lock the boot loader because v2 lacks the anti-Tivoization clause in v3.
If you use a license that lets the cathedral close off the community's work then you lose, but what if you don't do that?
Couldn't it be addressed in front of the application with a fail2ban rule, some kind of 429 Too Many Requests quota on a per session basis? Or are the crawlers anonymizing themselves / coming from different IP addresses?
Yeah, that's where IP intelligence comes in. They're using pretty big IP pools, so, either you're manually adding individual IPs to a list all day (and updating that list as ASNs get continuously shuffled around), or you've got a process in the background that essentially does whois lookups (and caches them, so you aren't also being abusive), parses the metadata returned, and decides whether that request is "okay" or not.
The classic 80/20 rule applies. You can catch about 80% of lazy crawler activity pretty easily with something like this, but the remaining 20% will require a lot more effort. You start encountering edge cases, like crawlers that use AWS for their crawling activity, but also one of your customers somewhere is syncing their WooCommerce orders to their in-house ERP system via a process that also runs on AWS.
I've had crawlers get stuck in a loop before on a search page where you basically could just keep adding things, even if there are no results. I filtered requests that are bots for sure (requests which are specified long past the point of any results). It was over a million unique IPs, most of which only doing 1 or 2 requests on their own (from many different ip blocks)
The point of this is to make things difficult for bots, not to annoy visitors of the site. I respect it is the dev's choice to do what they want with the software they create and make available for free. Anime is a polarizing format for reasons beyond the scope of this discussion. It definitely says a lot about the dev
They're presumably not crawling the same page repeatedly, and caching the pages long enough to persist between crawls would require careful thinking and consultation with clients (e.g. if they want their blog posts to show up quickly, or an "on sale" banner or etc).
It'd probably be easier to come at it from the other side and throw more resources at the DB or clean it up. I can't imagine what's going on that it's spending a full second on DB queries, but I also don't really use WP.
Its been a few years when i last worked with WP. But the performance issue is because they store a ton of the data in a key value store, instead of table with fixed columns.
This can result in a ton of individual row hits on your database, for what in any normal system is a single 0.1ms (often faster) DB request.
Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper. Its just that the WP is in general ** for performance.
If you want to see what a bad scraper does with parallel requests with little limits, yea, WP is going down without putting up any struggle. But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.
Is that WP Core or a result of plugins? If you know offhand, I don't need to know bad enough to be worth digging in.
> Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper.
I think there's still room for improvement there, but I get what you mean. I think an "ideal" bot would base it's QPS on response time and back off if it goes up, but it's also not unreasonable to say "any website should be able to handle 1 QPS without flopping over".
> Its just that the WP is in general * for performance.
WP gets a lot of hate, and much of it is deserved, but I genuinely don't think I could do much better with the constraint of supporting an often non-technical userbase with a plugin system that can do basically arbitrary things with varying qualities of developers.
> But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.
This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).
Combination of all ... Take in account, its been 8 years when i last worked in PHP and wordpress, so maybe things have improved but i doubt it as some issues are structural.
* PHP is a fire and forget programming language. So whenever you do a request, there is no persistence of data (unless you offload to a external cache server). This result in total rerendering of the PHP code.
* Then we have WP core, that is not exactly shy in its calls to the DB. The way they store data in a key/value system really hurts the performance. Remember what i said above about PHP, ... So if you have a design that is heavy, and your language need to redo all the calls.
* Followed by ... extensions that are, lets just say, not always optimally written. The plugins are often the main reason why you see so many leaked databases on the internet.
The issue of WP is that its design is like 25 years old. It gain most of its popularity because it was free and you where able to extend it with plugins. But its that same plugin system, that made it harder for the WP developers to really tackle the performance issues, as breaking a ton of plugins, often results in losing marketshare.
The main reason why WP has survived the increased web traffic, has been that PHP has increased in performance by a factor of 3x over the years, combined with server hardware itself getting faster and faster. It also helped that cache plugins exist for WP.
But now as you have noticed, when you have a ton of passive or aggressive scrapers hitting WP websites, the cache plugins what have been the main protection layer to keep WP sites functional, they can not handle this. As scrapers hit every page, even pages that are non-popular/archived/... and normally never get cached. Because your getting hit on those non-popular pages, this then shows the fundamental weakness of WP.
The only way you can slightly deal with this type of behavior (beyond just blocking scrapers), is by increasing your database memory limits by a ton, so your not doing constant swapping. Increase the caching of the pages on your actual WP cache extensions, so more is held into memory. Your probably also looking at increasing the amount of PHP instances your server can load, more DB ...
But that assumes you have control over your WP hosting environment. And the companies that often host 100.000 or millions of sites, are not exactly motivated to throw tons of money into the problem. They prefer that you "upgrade" to more expensive packages that will only partially mitigate the issue.
In general, everybody is f___ed ... The amount of data scraping is only going to get worse.
Especially now that LLM's have tool usage, as in, they can search the internet for information themselves. This is going to results in tens of millions of requests from LLMs. Somebody searching for cookie requests, may results in dozens of page hits, in a second, where a normal user in the past first did a google search (hits Google cache), and only then opens a page, ... not what they want, go back, somewhere else. What may have been 10 requests over multiple sites, over a 5, 10 min time frame, is now going to be parallel dozens of request per second.
LLMs are great search engines, but as the tech goes more to consumer level hardware, your going to see this only getting worse.
Solutions are a fundamental rework of a lot of websites. One of the main reasons i switch out of PHP years ago, and eventually settled on Go, was because even at that time, was that we hit hitting limits already. Its one of the reasons that Facebook made Hack (PHP with persistence and other optimizations). The days you can render complete pages, is just giving away performance. The days you can not internal cache data, ... you get the point.
> This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).
The issue is not cache content, is that they go for all the data in your database. They do not care if your articles are from 1999.
The only way you can solve this issue, is by having API endpoints to every website, where scraper can directly feed on your database data directly (so you avoid needing to render complete pages), AND where they can feed on /api/articles/latest-changed or something like that.
And that assumes that this is standardized over the industry. Because if its not, its just easier for scraper to go after all pages.
Fyi: I wrote my own scraper in Go, a dual core VPS that costs 3 Euro in the month, what can do 10.000 scraper per second (we are talking direct scraps, not over browser to deal with JS detection).
Now, do you want to guess the resource usage on your WP server, if i let it run wild ;) Your probably going to spend 10 to 50x more money, just to feed my scraper without me taking your website down.
Now, do i do 10.000 per second request. No ... Because 1r/s per website, is still 86400 page hits per day. And because i combined this with actually looking up websites that had "latest xxxx", and caching that content. I knew that i only needed to scrap X amount of new pages every 24h. So it took me a month or 3 for some big website scraping, and later you do not even see me as i am only doing page updates.
But that takes work! You need to design this for every website, some websites do not have any good spot where you can hook into for a low resource "is there something new".
And i do not even talk about websites that actively try to make scraping difficult (like constantly changing tags, dynamic html blocks on renders, js blocking, captcha forcing), what ironically, hurt them more as this can result in full rescraps of their sites.
So ironically, the most easy solution that for less scrupulous scrapers is to simply throw resource at the issue. Why bother with "is there something new" effort on every website, when you can just rescrap every page link you find using a dumb scraper, and compare that with your local cache checksum, and then update your scraped page result. And then you get those over aggressive scraper that ddos websites. Combine that with half of the internet being WP websites +lol+
The amount of resource to scrap, is so small, and the more you try to prevent scrapers, the more your going to hinder your own customers / legit users.
And again, this is just me doing scraping for some novel/manga websites for my own private usage / datahoarding. The big boys have access to complete IP blocks, can resort to using home ips (as some sites detect if your coming from a datacenter leased IP or home ISP ip), have way more resources available to them.
This has been way too long but the only way to win against scrapers, is that we will need a standardized way for legit scraping. Ironically we used to have this with RSS feeds years ago but everybody gave up on them. When you have a easier endpoint for scrapers, there is less incentive to just scrap your every page for a lot of them. Will there be bad guys, yep, but it then becomes easier to just target them until they also comply.
But the internet will need to change to something new for it to survive the new era ... And i think standardized API endpoints will be that change. Or everybody needs to go behind login pages, but yea, good luck with that because even those are very easy to bypass with account creations solutions.
Yea, everybody is going to be f___ed because forget about making money with advertisement for the small website. The revenue model is going to also change. We already see this with reddit selling their data directly to google.
> The way they store data in a key/value system really hurts the performance
It doesnt, unless your site has a lot of post/product/whatever entries in the db and you are having your users search from among them with multiple criteria at the same time. Only then does it cause many self-joins to happen and creates performance concerns. Otherwise the key-value setup is very fast when it comes to just pulling key+value pairs for a given post/content.
Today Wordpress is able to easily do 50 req/sec cached (locally) on $5/month hosting with PHP 8+. It can easily do 10 req/sec uncached for logged in users, with absolutely no form of caching. (though you would generally use an object cache, pushing it much higher).
White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.
Just want to point out that your 50 req/sec cached means nothing in case of dealing with scrapers. What is the entire topic ...
he issue is that scrapers hit so many pages, that you can never cache everything.
If you website is a 5 page blog, that has no build up archive of past posts, sure... Scrapers are not going to hurt because they keep hitting the cached pages and resetting the invalidation.
But for everybody else, getting hit on uncached pages, results in heavy DB loads, and kills your performance.
Scrapers do not care about your top (cached) pages, especially aggressive ones that just rescrape non-stop.
> It doesnt, unless your site has a lot of post/product/whatever entries in the db
Exactly what is being hit by scrapers...
> White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.
Again not the point. They can throw resources onto the problem, and cache tons of data with 512GB/1TB wordpress/DB servers. By that, turns WP into a mostly static site.
Its everybody else that feels the burn (see article, see the previous poster and other).
Do you understand the issue now? WP is not equipped to deal with this type of traffic as its not normal human traffic. WP is not designed to handle this, it barely handles normal traffic without throwing a lot of resources on it.
There is a reason why the reddit/Slashdot effect exists. Just a few 1000 people going to a blog tend to make a lot of WP websites unresponsive. And that is with the ability to cache those pages!
Now imagine somebody like me, that lets a scraper lose on your WP website. I can scrap 10.000 pages / sec on a 4 bucks VPS. But each page that i hit that is not in your cache, will make your DB scream even more, because of how WP works. So what are you going to do with your 50 req/s cached, when my next 9.950 req/s hit all your non-cached pages?! You get the point?
And fyi: 10.000r/s on your cached pages will make your wp install also unresponsive. The scraper resource usage vs WP is a fight nobody wins.
That would be nice! This doesn't work reliably enough for WP sites. Whether it's devs making changes and testing them in prod, or dynamic content loaded in identical URLs, my past attempts to cache html have caused questions and complaints. The current caching strategy hits a nice balance and hasn't bothered anyone, with the significant downside that it's vulnerable to bot traffic.
(If you choose to read this as, "WordPress is awful, don't use WordPress", I won't argue with you.)
This is probably a dumb question, but at what point do we put a simple CAPTCHA in front of every new user that arrives at a site, then give them a cookie and start tracking requests per second from that user?
I guess its a kind of soft login required for every session?
update: you could bake it into the cookie approval dialog (joke!)
The post-AI web is already a huge mess. I'd prefer solutions that don't make it worse.
I myself browse with cookies off, sort of, most of the time, and the number of times per day that I have to click a Cloudflare checkbox or help Google classify objects from its datasets is nuts.
My cousin manages a dozens of mid-sized informational websites and communities, his former hosting provider kicked him out because he refused to pay the insane bills as a result of literally AI bots DDoS-ing his sites...
He unfortunately had no choice to put most of the content behind a login-wall (you can only see parts of the articles/forum posts when logged out) but he is strongly considering just hard pay-walling some content at that point... We're talking about someone who in good faith provided partial data dumps of content freely available for these companies to download, but, caching / etags? none of these AI companies, hiring "the best and the brightest" have ever heard of that, rate limiting? LOL what is that?
This is nuts, these AI companies are ruining the web.
I'm not sure why they don't just cache the websites and avoid going back for at least 24 hours, especially in the case of most sites. I swear its like we're re-learning software engineering basics with LLMs / AI and it kills me.
It's worth noting that search engines back then (and now? except the AI ones) generally tended to follow robots.txt, which meant that if there were heavy areas of your site that you didn't want them to index you could filter them out and let them just follow static pages. You could block off all of /cgi-bin/ for example and then they would never be hitting your CGI scripts - useful if your guestbook software wrote out static files to be served, for example.
The search engines were also limited in resources, so they were judicious about what they fetched, when, and how often; optimizing their own crawlers saved them money, and in return it also saved the websites too. Even with a hundred crawlers actively indexing your site, they weren't going to index it more than, say, once a day, and 100 requests in a day isn't really that much even back then.
Now, companies are pumping billions of dollars into AI; budgets are infinite, limits are bypassed, and norms are ignored. If the company thinks it can benefit from indexing your site 30 times a minute then it will, but even if it doesn't benefit from it there's no reason for them to stop it from doing so because it doesn't cost them anything. They cannot risk being anything other than up-to-date, because if users are coming to you asking about current events and why space force is moving to Alabama and your AI doesn't know but someone else's does, then you're behind the times.
So in the interests of maximizing short-term profit above all else - which is the only thing AI companies are doing in any way shape or form - they may as well scrape every URL on your site once per second, because it doesn't cost them anything and they don't care if you go bankrupt and shut down.
This! Today I asked Claude Sonnet to read the Wikipedia article on “inference” and answer a few of my questions.
Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.
Still no cogent answer. Pathetic. Very much an Anthropic blindspot—to the point of being at least amoral and even immoral.
Do the big AI corporation that have profited greatly from Wikimedia Foundation give anything back? Or are they just large internet blood suckers without ethics?
Dario and Sam et al.: Contribute to the welfare of your own blood donors.
> Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.
Even worse when you consider that you can download all of Wikipedia for offline use...
I'm still learning the landscape of LLMs, but do we expect an LLM to be able to answer that? I didn't think they had meta information about their own operation.
It's because they don't give a shit whether the product works properly or not. By blocking AI scraping, sites are forcing AI companies to scrape faster before they're blocked. And faster means sloppier.
Slow march? It feels like we've been on that train a while honestly. It's embarrassing. We don't even have fully native GUIs they're all browser wrappers.
imo when it kills somebody it justifies extreme means such as feeding them with fabricated truths such as LLM generated and artificially corrupted text /s
I'll add my voice to others here that this is a huge problem especially for small hobbyist websites.
I help administer a somewhat popular railroading forum. We've had some of these AI crawlers hammering the site to the point that it became unusable to actual human beings. You design your architecture around certain assumptions, and one of those was definitely not "traffic quintuples."
We've ended up blocking lots of them, but it's a neverending game of whack-a-mole.
> one of those was definitely not "traffic quintuples."
O, it was... People warned about the mass usage of WordPress because of its performance issues.
The internet usage kept growing, even without LLM scraping in mass. Everybody wants more and more up to date info, recent price checks, and so many other features. This trend has been going on for over 10+ years.
Its just now, that bot scraping for LLMs has pushed some sites over the edge.
> We've ended up blocking lots of them, but it's a neverending game of whack-a-mole.
And unless you block every IP, you can not stop them. Its really easy to hide scrapers, especially if you use a slow scrap rate.
The issue comes when you have like one of the posters here, a setup where a DB call takes up to 1s for some product pages that are not in cache. Those sites already lived on borrowed time.
Ironically, better software on their site (like not using WP), will allow them to handle easily 1000x the volume for the same resources. And do not get me started in how badly configured a lot of sites are in the backend.
People are kind of blaming the wrong issue. Our needs for up to date, data, has been growing for over the last 10 years. Its just that people considered website that took 400ms to generate a webpage as ok. (when in reality they are wasting tons of resource or are limited in the backend)
This is something I have a hard time understanding. What is the point of this aggressive crawling? Gathering training data? Don't we already have massive repos of scraped web data being used for search indexing? Is this a coordination issue, each little AI startup having to scrape its own data because nobody is willing to share their stuff as regular dumps? For Wikipedia we have the official offline downloads, for books we have books3, but there's not an equivalent for the rest of the web? Could this be solved by some system where website operators submit text copies of their sites to a big database? Then in robots.txt or similar add a line that points to that database with a deep link to their site's mirrored content?
The obvious issues are: a) who would pay to host that database. b) Sites not participating because they don't want their content accessible by LLMs for training (so scraping will still provide an advantage over using the database). c) The people implementing these scrapers are unscrupulous and just won't bother respecting sites that direct them to an existing dumped version of their content. d) Strong opponents to AI will try poisoning the database with fake submissions...
Or does this proposed database basically already exist between Cloudflare and the Internet Archive, and we already know that the scrapers are some combination of dumb and belligerent and refuse to use anything but the live site?
I asked Google AI Mode “does Google ai mode make tens of site requests for a single prompt” and it showed “Looking at 69 sites” before giving a response about query fan-out.
Cloudflare has some large part of the web cached, IA takes too long to respond and couldn’t handle the load. Google/OpenAI and co could cache these pages but apparently don’t do it aggressively enough or at all
That's not how search engines work. They have a good idea of which pages might be frequently updated. That's how "news search" works, and even small startup search engines like blekko had news search.
Indeed. My understanding is that crawl is a real expense at scale so they optimize for "just enough" to catch most site update rhythms and then use other signals (like blog pings, or someone searching for a URL that's not yet crawled, etc) to selectively chase fresher content.
My experience is that a news crawl is not a big expense at scale, but so far I've only built one and inherited one. BTW No one uses blog pings, the latest hotness is IndexNow.
I suspect they simply so not care. Owners of these companies are exactly the sort of people that are genuinely puzzled and offended when someone wants them to think about anything but themselves
The attitude is visible in everything around AI, why would crawling be different?
Web scrapers earned their bad rep all on their own thank you very much. This is nothing new. Scrapers have no concern about whether a site is mostly static with stale text vs constantly updated. Most sites are not FB/Twitt..er,X/etc. Even retail sites not Amazon don't have new products being listed every minute. But that would involve someone on the scraper's side to pay attention and instead just let the computer run even if it is reading the same data every time.
Even if sites offered their content in a single downloadable file for bots, the bot creators would not trust it is not stale and out of date so they'd still continue to scrape ignoring the easy method.
I created and maintain ProtonDB, a popular Linux gaming resource. I don't do ads, just pay the bills from some Patreon donations.
It's a statically generated React site I deploy on Netlify. About ten days ago I started incurring 30GB of data per day from user agents indicating they're using Prerender. At this pace almost all of that will push me past the 1TB allotted for my plan, so I'm looking at an extra ~$500USD a month for the extra bandwdith boosters.
I'm gonna try the robots.txt options, but I'm doubtful this will be effective in the long run. Many other options aren't available if I want to continue using a SaaS like Netlify.
My initial thoughts are to either move to Cloudflare Pages/Workers where bandwidth is unlimited, or make an edge function that parses the user agent and hope it's effective enough. That'd be about $60 in edge function invocations.
I've got so many better things to do than play whack-a-mole on user agents and, when failing, pay this scraping ransom.
Can I just say fuck all y'all AI harvesters? This is a popular free service that helps get people off of their Microsoft dependency and live their lives on a libre operating system. You wanna leech on that? Fine, download the data dumps I already offer on an ODbL license instead of making me wonder why I fucking bother.
$500 for exceeding 1TB? The problem here isn't the crawlers, it's your price-gouging, extortionate hosting plan. Pick your favourite $5/month VPS platform - I suggest Hetzner with its 20TB limit (if their KYC process lets you in) or Digital Ocean if not (with only 1TB but overage is only a few bucks extra). Even freaking AWS, known for extremely high prices, is cheaper than that (but still too expensive so don't use it).
> The problem here isn't the crawlers, it's your price-gouging, extortionate hosting plan.
No, it's both.
The crawlers are lazy, apparently have no caching, and there is no immediately obvious way to instruct/force those crawlers to grab pages in a bandwidth-efficient manner. That being said, I would not be surprised if someone here will smugly contradict me with instructions on how to do just that.
In the near term, if I were hosting such a site I'd be looking into slimming down every byte I could manage, using fingerprinting to serve slim pages to the bots and exploring alternative hosting/CDN options.
One of the worst takes I've seen. Yes, that's expensive, but the individuals doing insane amounts of unnecessary scraping are the problem. Let's not act like this isn't the case.
To clarify the math. Netlify bills $50 for each 100GB over the Pro plan limit of 1TB. Which is the barrel I'm looking down just this month before others get the same idea. So yes, I'm squeezed on both side unless I put the work in to rehost.
I went to a Subway shop that charged $50 per lettuce strip past the first 20. As the worker sprinkled lettuce on my sandwich, I counted anxiously, biting my nails. 19, phew, I'm safe. I think I'll come back here tomorrow.
Tomorrow, someone in front of me asked for extra lettuce. The worker got confused and put it on my sandwich. I was charged $1000. Drat.
> The worker got confused and put it on my sandwich.
No, this is where you're completely and totally incorrect. There is no 'worker accidentally making a human mistake that costs you money' here. This is a 'multi-billion dollar company routinely runs scripts that they KNOW cost you money, but do it anyways because it generates profit for them'. To fix your example,
You RUN a Subway that sells sandwiches. Your lettuce provider charges you $1 per piece of lettuce. Your average customer is given $1 worth of lettuce in their sub. One customer keeps coming in, reaching over the counter, and grabbing handfuls of lettuce. You cannot ban this customer because they routinely put on disguises and ignore your signs saying 'NO EXTRA LETTUCE'. Eventually this bankrupts you, forces you to stop serving lettuce in your subs entirely, or you have to put up bars (eg, Cloudflare) over your lettuce bins.
I'm not sure what Netlify is doing, but the heaviest assets on your website are your javascript sources. Have you considered hosting those on GitHub pages, which has a generous free tier?
The images are from steamcdn-a.akamaihd.net, which I assume is already being hosted by a third-party (Steam)
Do you have the ability to block ASNs? I help sysadmin a DIY building forum, and we cut 80% of the load from our server by blocking all Alibaba IPs in ASN 45102. Singapore was sending the most bot traffic.
Your mistake is openly suggesting on HN that you're going to use Cloudflare, increasing the centralization of the internet and contributing to their attestation schemes, while society forces you to be a victim of the tragedy of the commons.
Another option that wouldn't contribute to more centralization might be neocities. They give you 3 TB for $5/month. That seems to be _the_ limit though. The dude runs his own CDN just for neocities, so it's not just reselling cloudflare or something.
P.S. Thank you for ProtonDB, it has been so incredibly helpful for getting some older games running.
You don't need to apologize - HN needs to get their heads out of the sand that not everything is a tragedy of the commons, there's a reason why centralization exists, and the decentralized internet as it is now comes with serious drawbacks. We're never going to overcome the popularity of big tech if we can't be honest with the problems they solve.
Also, sue me, the cathedral has defeated the bazaar. This was predictable, as the bazaar is a bunch of stonecutters competing with each other to sell the best stone for building the cathedral with. We reinvented the farmer's market, and thought that if all the farmers united, they could take down Walmart. It's never happening.
In this context, the farmers are trying to deal with rampant abuse that is inconceivable to handle on an individual level.
It's not clear to me what taking down Cloudflare/Walmart means in this context. Nor how banding together wouldn't just incur the very centralization that is presumably so bad it must be taken down.
Webmasters are really kinda stuck between a rock and a hard place with this one.
At least with what I'm doing poorly configured or outright malicious bots consume about 5000x
the resources than human visitors, so having no bot mitigation means I've basically given up and
decided I should try to make it as a vegetable farmer instead of doing stuff online.
Bot mitigation in practice is a tradeoff between what's enough of an obstacle to keep most of the bots out,
while at the same time not annoying the users so much they leave.
I think right now Anubis is one of the less bad options. Some users are annoyed by it (and it is annoying),
but it's less annoying than clicking fire hydrants 35 times and as long as you configure right it seems to
keep most of the bots out, or at least drives them to behave in a more identifiable manner.
Probably won't last forever, but I don't know what would besides like going full anacap special needs kid
and doing crypto microtransactions for each page request. Would unfortunately drive off not only the bots, but the
human visitors as well.
Anubis is extremely slow on low-end devices, it often takes >30 seconds to complete. Users deserve better, but I guess it's still a better experience than reCaptcha or Cloudflare.
Ironic part ... LLM are very good as solving CAPTCHA's. So the only people bothered by those same CAPTCHA's are the actual site visitors.
What sites need to do is temp block repeat request from the same IPs. Sure, some agents use 10.000's of IP's but if they are really so aggressive as people state, your going to run into the same IP's way more often then normal users.
That will kick out the over aggressive guys. I have done web scraping and limited it to around 1r/s. You never run into any blocking or detection that way because you hardly show up. But when you have some *** that send 1000's off parallel request down a website, because they never figured out query builders for large page hits. And do not know how to build checks to see from last-update pages.
One of the main issues i see, is some people simply write the most basic of basic scrapers. See link, follow, spawn process, scrap, see 100 more links ... Updates? Just rescrap website, repeat, repeat... Because it takes time to make a scrap template for each website, that knows where to check for updated. So some never bother.
And because companies like Fastly only measure things via javascript execution and assume everything that doesn't execute JS correctly is a bot, that 80% contains a whole bunch of human persons.
The Fastly report[1] has a couple of great quotes that mention Common Crawl's CCBot:
> Our observations also highlight the vital role of open data initiatives like Common Crawl. Unlike commercial crawlers, Common Crawl makes its data freely available to the public, helping create a more inclusive ecosystem for AI research and development. With coverage across 63% of the unique websites crawled by AI bots, substantially higher than most commercial alternatives, it plays a pivotal role in democratizing access to large-scale web data. This open-access model empowers a broader community of researchers and developers to train and improve AI models, fostering more diverse and widespread innovation in the field.
...
> What’s notable is that the top four crawlers (Meta, Google, OpenAI and Claude) seem to prefer Commerce websites. Common Crawl’s CCBot, whose open data set is widely used, has a balanced preference for Commerce, Media & Entertainment and High Tech sectors. Its commercial equivalents Timpibot and Diffbot seem to have a high preference for Media & Entertainment, perhaps to complement what’s available through Common Crawl.
And also there's one final number that isn't in the Fastly report but is in the EL Reg article[2]:
> The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.
It's really bad for anyone using anything other than Chrome to browse the web, or any accessability tools or privacy software, because a bunch of sites will now block you, assuming you're a web crawler.
This has been widely reported for months now. Anthropic just reported another $13B in funding. Clearly, the companies just do not care to invest any effort to improving their behavior.
Can I ask a stupid question? Why is this so much worse than what they were doing to gather articles for traditional search engines? I assume that they are gathering pretty much the same data? It is the same articles, no?
— I just realized these are callouts from the LLM on behalf of the client. I can see how this is problematic but it does seem like there should be a way to cache that
No, the traffic is not caused by client requests (like when your chat gpt session does a search and checks some sources). They are caused by training runs. The difference is that AI companies are not storing the data they scrape. They let the model ingest the data, then throw it away. When they train the next model, they scrape the entire Internet again. At least that's how I understand it.
People who didnt respect basic ethics, legal copyrights and common sense aren't gonna stop because they're a nuisance. They'll keep at it until they've ruined what birthed them so they may replace it. Fuck AI.
Why don't sites just start publishing a dump of their site that crawlers could pull instead? I realize that won't work for dynamic content, but surely a lot of these "small" sites that are out there which are currently getting hammered, are not purely dynamic content?
Maybe we could just publish a dump, in a standard format (WARC?), at a well-known address, and have the crawlers check there? The content could be regularly updated, and use an etag/etc so that crawlers know when its been updated.
I suspect that even some dynamic sites could essentially snapshot themselves periodically, maybe once every few hours, and put it up for download to satiate these crawlers while keeping the bulk of the serving capacity for actual humans.
Because crawlers aren't concerned about the bandwidth of the sites they crawl and will simply continue to take everything, everywhere, all the time regardless of what sites do.
Also it's unfair to expect every small site to put in the time and effort to, in essence, pay the Danegeld to AI companies just for the privilege of their continued existence. It shouldn't be the case that the web only exists to feed AI, or that everyone must design their sites around feeding AI.
"It used to be when search indexing crawler, Googlebot, came calling, I could always hope that some story on my site would land on the magical first page of someone's search results so they'd visit me, they'd read the story, and two or three times out of a hundred visits, they'd click on an ad, and I'd get a few pennies of income."
Can we get an AI-powered tarpit for these crawlers?
Add a hidden link, put it in robots.txt
A crawler hits that link, a light-on-resources language model produces infinite amounts of plausible-looking gibberish for them to crawl with links and everything.
Is this data being collected for training sets? That seems problematic. I can't be the only one who's noticed that the web is quickly filling up with AI generated clickbait (which has made using a search engine more difficult).
I just block them by User Agent string[1]. The rest that fake the UA get clobbered by rate limiting[2] on the web server. Not perfect, but our site is not getting hammered any more.
There is a very large scale crawler that uses random valid user agents and a staggeringly large pool of ips. I first noticed it because a lot of traffic was coming from Brazil and "HostRoyale" (asn 203020). They send only a few requests a day from each ip so rate limiting is not useful.
I run a honeypot that generates urls with the source IP so I am pretty confident it is all one bot, in the past 48 hours I have had over 200,000 ips hit the honeypot.
I am pretty sure this is Bytedance, they occasionally hit these tagged honeypot urls with their normal user agent and their usual .sg datacenter.
I wonder if you could implement a dummy rate limit? Half the time you are rate limited randomly. A real user will think nothing of it and refresh the page.
If they are a real user going on your site in 2025 then they have no alternative they are even interested in. They will blame their ISP and wait.
Meanwhile rate limiting the llm could potentially cost a lot of money in time and compute to people who don’t have our best interests at heart. Seems like a win to me.
I've written my own bots that do exactly this. My reason was mainly to avoid detection so as part of that I also severely throttled my requests and hit the target at random intervals. In other words, I wasn't trying to abuse them. I just didn't want them to notice me.
TLDR it's trivial to send fake info when you're the one who controls the info.
Because it's a bad solution. The core problem is that the internet is vulnerable to DDoS attacks and the web has no native sybil resistance mechanism.
Cloudflare's solution to every problem is to allow them to control more of the internet. What happens when they have enough control to do whatever they want? They could charge any price they want.
I'm more afraid of the orgs that are gaining enough control of knowledge, cognition, and creativity that they'll be able to charge any price for them once they've trained us out of practicing them ourselves.
The idea itself has merit, even if the implementation is questionable.
Giving bots a cryptographic identity would allow good bots to meaningfully
have skin in the game and crawl with their reputation at stake. It's not
a complete solution, but could be part of one. Though you can likely get
the good parts from HTTP request signing alone, Cloudflare's additions to that
seem fairly extraneous.
I honestly don't know what is a good solution. The status quo is certainly
completely untenable. If we keep going like we are now, there won't be a web
left to protect in a few years. It's worth keeping in mind that there's an
opportunity cost, and even a bad solution may be preferrable to no solution at all.
I think the solution is some sort of PoW gateway like people are setting up now. Or a micropayments system where each page request costs a fraction of a penny.
You could combine that with some sort of IPFS/Bittorrent like system where you allow others to rehost your static content, indexed by the merkle hash of the content. That would allow users to donate bandwidth.
I really don't like the idea that you can get out of this by surveiling user agents more or distinguishing between "good" and "bad" bots which is a massive social problem.
I think we are just reaping the delayed storm of the insanely inefficient web we have created over the past decades.
There is absolutely no need for vast majority of websites to use databases and SSR, most of the web can be statically rendered and cost peanuts to host, but alas WP is the most popular "framework"
What if content providers reduced the 30k word page for a recipe down to just the actual recipe, would this reduce the amount of data these bots are pulling down?
I don't see this slowing down. If websites don't adapt to the AI deep search reality, the bot will just go somewhere else. People don't want to read these massive long form pages geared at outdated Google SEO techniques.
You're painting this as a problem that is somehow related to overly long form text based web pages. It isn't. If you host a local cleaning company site, or a game walkthrough site, or a roleplaying forum, the bots will flood the gates all the same.
You are right that it doesn't look like it is slowing down, but the developing result of this will not be people posting a shorter recipe, it will be a further contraction of the public facing, open internet.
That's only a stopgap measure, eventually they'll realize what's happening and use distributed IPs and fake user agents to look like normal users. The Tencent and Bytedance scrapers are already doing this.
People outside of a really small sysadmin niche really don't grasp the scale of this problem.
I run a small-but-growing boutique hosting infrastructure for agency clients. The AI bot crawler problem recently got severe enough that I couldn't just ignore it anymore.
I'm stuck between, on one end, crawlers from companies that absolutely have the engineering talent and resources to do things right but still aren't, and on the other end, resource-heavy WordPress installations where the client was told it was a build-it-and-forget-it kind of thing. I can't police their robots.txt files; meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL), there are about 6 different pretty aggressive AI bots, and occasionally they'll get stuck on some site's product variants or categories pages and start hitting it at a 1r/s rate.
There's an invisible caching layer that does a pretty nice job with images and the like, so it's not really a bandwidth problem. The bots aren't even requesting images and other page resources very often; they're just doing tons and tons of page requests, and each of those is tying up a DB somewhere.
Cumulatively, it is close to having a site get Slashdotted every single day.
I finally started filtering out most bot and crawler traffic at nginx, before it gets passed off to a WP container. I spent a fair bit of time sampling traffic from logs, and at a rough guess, I'd say maybe 5% of web traffic is currently coming from actual humans. It's insane.
I've just wrapped up the first round of work for this problem, but that's just buying a little time. Now, I've gotta put together an IP intelligence system, because clearly these companies aren't gonna take "403" for an answer.
I might write a blog post on this, but I seriously believe we collectively need to rethink The Cathedral and the Bazaar.
The Cathedral won. Full stop. Everyone, more or less, is just a stonecutter, competing to sell the best stone (i.e. content, libraries, source code, tooling) for building the cathedrals with. If the world is a farmer's market, we're shocked that the farmer's market is not defeating Walmart, and never will.
People want Cathedrals; not Bazaars. Being a Bazaar vendor is a race to the bottom. This is not the Cathedral exploiting a "tragedy of the commons," it's intrinsic to decentralization as a whole. The Bazaar feeds the Cathedral, just as the farmers feed Walmart, just as independent websites feed Claude, a food chain and not an aberration.
The Cathedral and the Bazaar meets The Tragedy of the Commons.
Let's say there's two competing options in some market. One option is fully commercialized, the other option holds to open-source ideals (whatever those are).
The commercial option attracts investors, because investors like money. The money attracts engineers, because at some point "hacker" came to mean "comfortable lifestyle in a high COL area". The commercial option gets all the resources, it gets a marketing team, and it captures 75% of the market because most people will happily pay a few dollars for something they don't have to understand.
The open source option attracts a few enthusiasts (maybe; or, often, just one), who labor at it in whatever spare time they can scrape together. Because it's free, other commercial entities use and rely on the open source thing, as long it continues to be maintained in something that, if you squint, resembles slave labor. The open source option is always a bit harder to use, with fewer features, but it appeals to the 25% of the market that cares about things like privacy or ownership or self-determination.
So, one conclusion is "people want Cathedrals", but another conclusion could be that all of our society's incentives are aligned towards Cathedrals.
It would be insane, after all, to not pursue wealth just because of some personal ideals.
This is pretty much a more eloquent version of what I was about to write. It's dangerous to take a completely results oriented view of a situation where the commercial incentives are so absurdly lopsided. The cathedral owners spend more than the GDP of most countries every year on various carrots and sticks to maintain something like the current ecosystem. I think the current world is far from ideal for most people, but it's hard to compete against the coordinated efforts of the richest and most powerful entities in the world.
The answer is quite simply that where complexity exceeds the regular person's interest, there will be a cathedral.
It's not about capitalism or incentives. Humans have cognitive limits and technology is very low on the list for most. They want someone else to handle complexity so they can focus on their lives. Medieval guilds, religious hierarchies, tribal councils, your distribution's package repository, it's all cathedrals. Humans have always delegated complexity to trusted authorities.
The 25% who 'care about privacy or ownership' mostly just say they care. When actually faced with configuring their own email server or compiling their own kernel, 24% of that 25% immediately choose the cathedral. You know the type, the people who attend FOSDEM carrying MacBooks. The incentives don't create the demand for cathedrals, but respond to it. Even in a post-scarcity commune, someone would emerge to handle the complex stuff while everyone else gratefully lets them.
The bazaar doesn't lose because of capitalism. It loses because most humans, given the choice between understanding something complex or trusting someone else to handle it, will choose trust every time. Not just trust, but CYA (I'm not responsible for something I don't fully understand) every time. Why do you think AI is successful? I'd rather even trust a blathering robot than myself. It turns out, people like being told what to do on things they don't care about.
> The Bazaar feeds the Cathedral
Isn't this the licensing problem? Berkeley release BSD so that everyone can use it, people do years of work to make it passable, Apple takes it to make macOS and iOS because the license allows them to, and then they have both the community's work and their own work so everyone uses that.
The Linux kernel is GPLv2, not GPLv3, so vendors distribute binary blob drivers/firmware with their hardware and then the hardware becomes unusable as soon as they stop publishing new versions because then to use the hardware you're stuck with an old kernel with known security vulnerabilities, or they lock the boot loader because v2 lacks the anti-Tivoization clause in v3.
If you use a license that lets the cathedral close off the community's work then you lose, but what if you don't do that?
There are some ASN-based DROP list collections on GitHub if that would help.
Oh! That didn't even occur to me. Yeah, I could pump that into ipset. Got one in particular that you think is reliable?
I think Spamhaus runs the big one.
Couldn't it be addressed in front of the application with a fail2ban rule, some kind of 429 Too Many Requests quota on a per session basis? Or are the crawlers anonymizing themselves / coming from different IP addresses?
Yeah, that's where IP intelligence comes in. They're using pretty big IP pools, so, either you're manually adding individual IPs to a list all day (and updating that list as ASNs get continuously shuffled around), or you've got a process in the background that essentially does whois lookups (and caches them, so you aren't also being abusive), parses the metadata returned, and decides whether that request is "okay" or not.
The classic 80/20 rule applies. You can catch about 80% of lazy crawler activity pretty easily with something like this, but the remaining 20% will require a lot more effort. You start encountering edge cases, like crawlers that use AWS for their crawling activity, but also one of your customers somewhere is syncing their WooCommerce orders to their in-house ERP system via a process that also runs on AWS.
I've had crawlers get stuck in a loop before on a search page where you basically could just keep adding things, even if there are no results. I filtered requests that are bots for sure (requests which are specified long past the point of any results). It was over a million unique IPs, most of which only doing 1 or 2 requests on their own (from many different ip blocks)
They are spreading themselves across lots of different IP blocks
Its called Anubis.
Anubis blocks all phones with odd processor counts, many Pixel phones for example.
Isn't that the one that shows anime characters? Or is Anubis the "professional" version that doesn't show anime chars?
Yes that's Anubis. And yes you pay to not show anime cat girl.
That's genius.
Honestly the more Anubis' anime mascot annoys people the more I like it.
The point of this is to make things difficult for bots, not to annoy visitors of the site. I respect it is the dev's choice to do what they want with the software they create and make available for free. Anime is a polarizing format for reasons beyond the scope of this discussion. It definitely says a lot about the dev
It says a lot more about the pearl clutching of the people complaining about it than it does the dev.
Anime is only "polarizing" for an extreme subset of people. Most people won't care. No one should care, it's just a cute mascot image.
> meanwhile, each page load can take a full 1s round trip (most of that spent in MySQL)
Can't these responses still be cached by a reverse proxy as long as the user isn't logged in, which the bots presumably aren't?
They're presumably not crawling the same page repeatedly, and caching the pages long enough to persist between crawls would require careful thinking and consultation with clients (e.g. if they want their blog posts to show up quickly, or an "on sale" banner or etc).
It'd probably be easier to come at it from the other side and throw more resources at the DB or clean it up. I can't imagine what's going on that it's spending a full second on DB queries, but I also don't really use WP.
Its been a few years when i last worked with WP. But the performance issue is because they store a ton of the data in a key value store, instead of table with fixed columns.
This can result in a ton of individual row hits on your database, for what in any normal system is a single 0.1ms (often faster) DB request.
Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper. Its just that the WP is in general ** for performance.
If you want to see what a bad scraper does with parallel requests with little limits, yea, WP is going down without putting up any struggle. But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.
Is that WP Core or a result of plugins? If you know offhand, I don't need to know bad enough to be worth digging in.
> Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper.
I think there's still room for improvement there, but I get what you mean. I think an "ideal" bot would base it's QPS on response time and back off if it goes up, but it's also not unreasonable to say "any website should be able to handle 1 QPS without flopping over".
> Its just that the WP is in general * for performance.
WP gets a lot of hate, and much of it is deserved, but I genuinely don't think I could do much better with the constraint of supporting an often non-technical userbase with a plugin system that can do basically arbitrary things with varying qualities of developers.
> But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.
This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).
> Is that WP Core or a result of plugins?
Combination of all ... Take in account, its been 8 years when i last worked in PHP and wordpress, so maybe things have improved but i doubt it as some issues are structural.
* PHP is a fire and forget programming language. So whenever you do a request, there is no persistence of data (unless you offload to a external cache server). This result in total rerendering of the PHP code.
* Then we have WP core, that is not exactly shy in its calls to the DB. The way they store data in a key/value system really hurts the performance. Remember what i said above about PHP, ... So if you have a design that is heavy, and your language need to redo all the calls.
* Followed by ... extensions that are, lets just say, not always optimally written. The plugins are often the main reason why you see so many leaked databases on the internet.
The issue of WP is that its design is like 25 years old. It gain most of its popularity because it was free and you where able to extend it with plugins. But its that same plugin system, that made it harder for the WP developers to really tackle the performance issues, as breaking a ton of plugins, often results in losing marketshare.
The main reason why WP has survived the increased web traffic, has been that PHP has increased in performance by a factor of 3x over the years, combined with server hardware itself getting faster and faster. It also helped that cache plugins exist for WP.
But now as you have noticed, when you have a ton of passive or aggressive scrapers hitting WP websites, the cache plugins what have been the main protection layer to keep WP sites functional, they can not handle this. As scrapers hit every page, even pages that are non-popular/archived/... and normally never get cached. Because your getting hit on those non-popular pages, this then shows the fundamental weakness of WP.
The only way you can slightly deal with this type of behavior (beyond just blocking scrapers), is by increasing your database memory limits by a ton, so your not doing constant swapping. Increase the caching of the pages on your actual WP cache extensions, so more is held into memory. Your probably also looking at increasing the amount of PHP instances your server can load, more DB ...
But that assumes you have control over your WP hosting environment. And the companies that often host 100.000 or millions of sites, are not exactly motivated to throw tons of money into the problem. They prefer that you "upgrade" to more expensive packages that will only partially mitigate the issue.
In general, everybody is f___ed ... The amount of data scraping is only going to get worse.
Especially now that LLM's have tool usage, as in, they can search the internet for information themselves. This is going to results in tens of millions of requests from LLMs. Somebody searching for cookie requests, may results in dozens of page hits, in a second, where a normal user in the past first did a google search (hits Google cache), and only then opens a page, ... not what they want, go back, somewhere else. What may have been 10 requests over multiple sites, over a 5, 10 min time frame, is now going to be parallel dozens of request per second.
LLMs are great search engines, but as the tech goes more to consumer level hardware, your going to see this only getting worse.
Solutions are a fundamental rework of a lot of websites. One of the main reasons i switch out of PHP years ago, and eventually settled on Go, was because even at that time, was that we hit hitting limits already. Its one of the reasons that Facebook made Hack (PHP with persistence and other optimizations). The days you can render complete pages, is just giving away performance. The days you can not internal cache data, ... you get the point.
> This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).
The issue is not cache content, is that they go for all the data in your database. They do not care if your articles are from 1999.
The only way you can solve this issue, is by having API endpoints to every website, where scraper can directly feed on your database data directly (so you avoid needing to render complete pages), AND where they can feed on /api/articles/latest-changed or something like that.
And that assumes that this is standardized over the industry. Because if its not, its just easier for scraper to go after all pages.
Fyi: I wrote my own scraper in Go, a dual core VPS that costs 3 Euro in the month, what can do 10.000 scraper per second (we are talking direct scraps, not over browser to deal with JS detection).
Now, do you want to guess the resource usage on your WP server, if i let it run wild ;) Your probably going to spend 10 to 50x more money, just to feed my scraper without me taking your website down.
Now, do i do 10.000 per second request. No ... Because 1r/s per website, is still 86400 page hits per day. And because i combined this with actually looking up websites that had "latest xxxx", and caching that content. I knew that i only needed to scrap X amount of new pages every 24h. So it took me a month or 3 for some big website scraping, and later you do not even see me as i am only doing page updates.
But that takes work! You need to design this for every website, some websites do not have any good spot where you can hook into for a low resource "is there something new".
And i do not even talk about websites that actively try to make scraping difficult (like constantly changing tags, dynamic html blocks on renders, js blocking, captcha forcing), what ironically, hurt them more as this can result in full rescraps of their sites.
So ironically, the most easy solution that for less scrupulous scrapers is to simply throw resource at the issue. Why bother with "is there something new" effort on every website, when you can just rescrap every page link you find using a dumb scraper, and compare that with your local cache checksum, and then update your scraped page result. And then you get those over aggressive scraper that ddos websites. Combine that with half of the internet being WP websites +lol+
The amount of resource to scrap, is so small, and the more you try to prevent scrapers, the more your going to hinder your own customers / legit users.
And again, this is just me doing scraping for some novel/manga websites for my own private usage / datahoarding. The big boys have access to complete IP blocks, can resort to using home ips (as some sites detect if your coming from a datacenter leased IP or home ISP ip), have way more resources available to them.
This has been way too long but the only way to win against scrapers, is that we will need a standardized way for legit scraping. Ironically we used to have this with RSS feeds years ago but everybody gave up on them. When you have a easier endpoint for scrapers, there is less incentive to just scrap your every page for a lot of them. Will there be bad guys, yep, but it then becomes easier to just target them until they also comply.
But the internet will need to change to something new for it to survive the new era ... And i think standardized API endpoints will be that change. Or everybody needs to go behind login pages, but yea, good luck with that because even those are very easy to bypass with account creations solutions.
Yea, everybody is going to be f___ed because forget about making money with advertisement for the small website. The revenue model is going to also change. We already see this with reddit selling their data directly to google.
And this has been way too much text.
> The way they store data in a key/value system really hurts the performance
It doesnt, unless your site has a lot of post/product/whatever entries in the db and you are having your users search from among them with multiple criteria at the same time. Only then does it cause many self-joins to happen and creates performance concerns. Otherwise the key-value setup is very fast when it comes to just pulling key+value pairs for a given post/content.
Today Wordpress is able to easily do 50 req/sec cached (locally) on $5/month hosting with PHP 8+. It can easily do 10 req/sec uncached for logged in users, with absolutely no form of caching. (though you would generally use an object cache, pushing it much higher).
White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.
Just want to point out that your 50 req/sec cached means nothing in case of dealing with scrapers. What is the entire topic ...
he issue is that scrapers hit so many pages, that you can never cache everything.
If you website is a 5 page blog, that has no build up archive of past posts, sure... Scrapers are not going to hurt because they keep hitting the cached pages and resetting the invalidation.
But for everybody else, getting hit on uncached pages, results in heavy DB loads, and kills your performance.
Scrapers do not care about your top (cached) pages, especially aggressive ones that just rescrape non-stop.
> It doesnt, unless your site has a lot of post/product/whatever entries in the db
Exactly what is being hit by scrapers...
> White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.
Again not the point. They can throw resources onto the problem, and cache tons of data with 512GB/1TB wordpress/DB servers. By that, turns WP into a mostly static site.
Its everybody else that feels the burn (see article, see the previous poster and other).
Do you understand the issue now? WP is not equipped to deal with this type of traffic as its not normal human traffic. WP is not designed to handle this, it barely handles normal traffic without throwing a lot of resources on it.
There is a reason why the reddit/Slashdot effect exists. Just a few 1000 people going to a blog tend to make a lot of WP websites unresponsive. And that is with the ability to cache those pages!
Now imagine somebody like me, that lets a scraper lose on your WP website. I can scrap 10.000 pages / sec on a 4 bucks VPS. But each page that i hit that is not in your cache, will make your DB scream even more, because of how WP works. So what are you going to do with your 50 req/s cached, when my next 9.950 req/s hit all your non-cached pages?! You get the point?
And fyi: 10.000r/s on your cached pages will make your wp install also unresponsive. The scraper resource usage vs WP is a fight nobody wins.
That would be nice! This doesn't work reliably enough for WP sites. Whether it's devs making changes and testing them in prod, or dynamic content loaded in identical URLs, my past attempts to cache html have caused questions and complaints. The current caching strategy hits a nice balance and hasn't bothered anyone, with the significant downside that it's vulnerable to bot traffic.
(If you choose to read this as, "WordPress is awful, don't use WordPress", I won't argue with you.)
This is probably a dumb question, but at what point do we put a simple CAPTCHA in front of every new user that arrives at a site, then give them a cookie and start tracking requests per second from that user?
I guess its a kind of soft login required for every session?
update: you could bake it into the cookie approval dialog (joke!)
The post-AI web is already a huge mess. I'd prefer solutions that don't make it worse.
I myself browse with cookies off, sort of, most of the time, and the number of times per day that I have to click a Cloudflare checkbox or help Google classify objects from its datasets is nuts.
> The post-AI web is already a huge mess.
You mean the peri-AI web? Or is AI already done and over and no longer exerting an influence?
My cousin manages a dozens of mid-sized informational websites and communities, his former hosting provider kicked him out because he refused to pay the insane bills as a result of literally AI bots DDoS-ing his sites...
He unfortunately had no choice to put most of the content behind a login-wall (you can only see parts of the articles/forum posts when logged out) but he is strongly considering just hard pay-walling some content at that point... We're talking about someone who in good faith provided partial data dumps of content freely available for these companies to download, but, caching / etags? none of these AI companies, hiring "the best and the brightest" have ever heard of that, rate limiting? LOL what is that?
This is nuts, these AI companies are ruining the web.
I'm not sure why they don't just cache the websites and avoid going back for at least 24 hours, especially in the case of most sites. I swear its like we're re-learning software engineering basics with LLMs / AI and it kills me.
Yeah the landscpe when there were many more Search engines must have been exactly the same...
I think the eng teams behind those were just more competent / more frugal on their processing.
And since there wasn't any AWS equivalent, they had to be better citizens since well-known IP range ban for the crawled websites was trivial.
It's worth noting that search engines back then (and now? except the AI ones) generally tended to follow robots.txt, which meant that if there were heavy areas of your site that you didn't want them to index you could filter them out and let them just follow static pages. You could block off all of /cgi-bin/ for example and then they would never be hitting your CGI scripts - useful if your guestbook software wrote out static files to be served, for example.
The search engines were also limited in resources, so they were judicious about what they fetched, when, and how often; optimizing their own crawlers saved them money, and in return it also saved the websites too. Even with a hundred crawlers actively indexing your site, they weren't going to index it more than, say, once a day, and 100 requests in a day isn't really that much even back then.
Now, companies are pumping billions of dollars into AI; budgets are infinite, limits are bypassed, and norms are ignored. If the company thinks it can benefit from indexing your site 30 times a minute then it will, but even if it doesn't benefit from it there's no reason for them to stop it from doing so because it doesn't cost them anything. They cannot risk being anything other than up-to-date, because if users are coming to you asking about current events and why space force is moving to Alabama and your AI doesn't know but someone else's does, then you're behind the times.
So in the interests of maximizing short-term profit above all else - which is the only thing AI companies are doing in any way shape or form - they may as well scrape every URL on your site once per second, because it doesn't cost them anything and they don't care if you go bankrupt and shut down.
Bandwidth cost more then, so the early search engines had an inventive not to massively increase their own costs if nothing else.
The blekko search engine index was only 1 billion pages, compared to Common Crawl Foundation's crawl of 3 billion webpages per month.
This! Today I asked Claude Sonnet to read the Wikipedia article on “inference” and answer a few of my questions.
Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.
Still no cogent answer. Pathetic. Very much an Anthropic blindspot—to the point of being at least amoral and even immoral.
Do the big AI corporation that have profited greatly from Wikimedia Foundation give anything back? Or are they just large internet blood suckers without ethics?
Dario and Sam et al.: Contribute to the welfare of your own blood donors.
> Sonnet responded: “Sorry, I have no access.” Then I asked it why and it was flummoxed and confused. I asked why Anthropic did not simply maintain mirrors of Wikipedia in XX different languages and run a cron job every week.
Even worse when you consider that you can download all of Wikipedia for offline use...
> Then I asked it why
I'm still learning the landscape of LLMs, but do we expect an LLM to be able to answer that? I didn't think they had meta information about their own operation.
your understanding is correct.
you can even torrent all of wikipedia, and a whole bunch of other wikis.
Would be great if they did that and maybe seeded it too.
Once the crawler goes up, who cares what it brings down?
That's not my department! says Crawler von Braun
That's gold, I've just stumbled on the original a week ago
It's because they don't give a shit whether the product works properly or not. By blocking AI scraping, sites are forcing AI companies to scrape faster before they're blocked. And faster means sloppier.
There’s also the point that if the web site is down after you scraped it, then that’s 1 more sites data you’ve scraped that your competition now cant
I guess they prefer paying for bandwidth rather than storage
The people at the forefront of creating the shortcut machine are taking shortcuts. We're on a slow march towards the death of attention to detail.
Slow march? It feels like we've been on that train a while honestly. It's embarrassing. We don't even have fully native GUIs they're all browser wrappers.
Who says they don't?
imo when it kills somebody it justifies extreme means such as feeding them with fabricated truths such as LLM generated and artificially corrupted text /s
I'll add my voice to others here that this is a huge problem especially for small hobbyist websites.
I help administer a somewhat popular railroading forum. We've had some of these AI crawlers hammering the site to the point that it became unusable to actual human beings. You design your architecture around certain assumptions, and one of those was definitely not "traffic quintuples."
We've ended up blocking lots of them, but it's a neverending game of whack-a-mole.
> one of those was definitely not "traffic quintuples."
O, it was... People warned about the mass usage of WordPress because of its performance issues.
The internet usage kept growing, even without LLM scraping in mass. Everybody wants more and more up to date info, recent price checks, and so many other features. This trend has been going on for over 10+ years.
Its just now, that bot scraping for LLMs has pushed some sites over the edge.
> We've ended up blocking lots of them, but it's a neverending game of whack-a-mole.
And unless you block every IP, you can not stop them. Its really easy to hide scrapers, especially if you use a slow scrap rate.
The issue comes when you have like one of the posters here, a setup where a DB call takes up to 1s for some product pages that are not in cache. Those sites already lived on borrowed time.
Ironically, better software on their site (like not using WP), will allow them to handle easily 1000x the volume for the same resources. And do not get me started in how badly configured a lot of sites are in the backend.
People are kind of blaming the wrong issue. Our needs for up to date, data, has been growing for over the last 10 years. Its just that people considered website that took 400ms to generate a webpage as ok. (when in reality they are wasting tons of resource or are limited in the backend)
This is something I have a hard time understanding. What is the point of this aggressive crawling? Gathering training data? Don't we already have massive repos of scraped web data being used for search indexing? Is this a coordination issue, each little AI startup having to scrape its own data because nobody is willing to share their stuff as regular dumps? For Wikipedia we have the official offline downloads, for books we have books3, but there's not an equivalent for the rest of the web? Could this be solved by some system where website operators submit text copies of their sites to a big database? Then in robots.txt or similar add a line that points to that database with a deep link to their site's mirrored content?
The obvious issues are: a) who would pay to host that database. b) Sites not participating because they don't want their content accessible by LLMs for training (so scraping will still provide an advantage over using the database). c) The people implementing these scrapers are unscrupulous and just won't bother respecting sites that direct them to an existing dumped version of their content. d) Strong opponents to AI will try poisoning the database with fake submissions...
Or does this proposed database basically already exist between Cloudflare and the Internet Archive, and we already know that the scrapers are some combination of dumb and belligerent and refuse to use anything but the live site?
I asked Google AI Mode “does Google ai mode make tens of site requests for a single prompt” and it showed “Looking at 69 sites” before giving a response about query fan-out.
Cloudflare has some large part of the web cached, IA takes too long to respond and couldn’t handle the load. Google/OpenAI and co could cache these pages but apparently don’t do it aggressively enough or at all
I don't think you're correct about Google. Caching webpages is bread-and-butter for search engines, that's how they show snippets.
They might cache it, but what if it changed in the last 30 seconds and now their information is out of date? Better make another request just in case.
That's not how search engines work. They have a good idea of which pages might be frequently updated. That's how "news search" works, and even small startup search engines like blekko had news search.
Indeed. My understanding is that crawl is a real expense at scale so they optimize for "just enough" to catch most site update rhythms and then use other signals (like blog pings, or someone searching for a URL that's not yet crawled, etc) to selectively chase fresher content.
My experience is that a news crawl is not a big expense at scale, but so far I've only built one and inherited one. BTW No one uses blog pings, the latest hotness is IndexNow.
I suspect they simply so not care. Owners of these companies are exactly the sort of people that are genuinely puzzled and offended when someone wants them to think about anything but themselves
The attitude is visible in everything around AI, why would crawling be different?
I was sysadmining a virtual art gallery thousands of "exhibits" including sound, video, and images.
We had never had any issue before and suddenly we get taken down 3 times in as many days. When I investigated it was all claude.
They were just pounding every route regardless of timeouts with no throttle. It was nasty.
They give web scrapers a bad rep.
Web scrapers earned their bad rep all on their own thank you very much. This is nothing new. Scrapers have no concern about whether a site is mostly static with stale text vs constantly updated. Most sites are not FB/Twitt..er,X/etc. Even retail sites not Amazon don't have new products being listed every minute. But that would involve someone on the scraper's side to pay attention and instead just let the computer run even if it is reading the same data every time.
Even if sites offered their content in a single downloadable file for bots, the bot creators would not trust it is not stale and out of date so they'd still continue to scrape ignoring the easy method.
I created and maintain ProtonDB, a popular Linux gaming resource. I don't do ads, just pay the bills from some Patreon donations.
It's a statically generated React site I deploy on Netlify. About ten days ago I started incurring 30GB of data per day from user agents indicating they're using Prerender. At this pace almost all of that will push me past the 1TB allotted for my plan, so I'm looking at an extra ~$500USD a month for the extra bandwdith boosters.
I'm gonna try the robots.txt options, but I'm doubtful this will be effective in the long run. Many other options aren't available if I want to continue using a SaaS like Netlify.
My initial thoughts are to either move to Cloudflare Pages/Workers where bandwidth is unlimited, or make an edge function that parses the user agent and hope it's effective enough. That'd be about $60 in edge function invocations.
I've got so many better things to do than play whack-a-mole on user agents and, when failing, pay this scraping ransom.
Can I just say fuck all y'all AI harvesters? This is a popular free service that helps get people off of their Microsoft dependency and live their lives on a libre operating system. You wanna leech on that? Fine, download the data dumps I already offer on an ODbL license instead of making me wonder why I fucking bother.
Proton DB is an amazing website that I use all the time. Thank you for maintaining it!
Thanks. Appreciate your support, and very glad it brings you value.
$500 for exceeding 1TB? The problem here isn't the crawlers, it's your price-gouging, extortionate hosting plan. Pick your favourite $5/month VPS platform - I suggest Hetzner with its 20TB limit (if their KYC process lets you in) or Digital Ocean if not (with only 1TB but overage is only a few bucks extra). Even freaking AWS, known for extremely high prices, is cheaper than that (but still too expensive so don't use it).
> The problem here isn't the crawlers, it's your price-gouging, extortionate hosting plan.
No, it's both.
The crawlers are lazy, apparently have no caching, and there is no immediately obvious way to instruct/force those crawlers to grab pages in a bandwidth-efficient manner. That being said, I would not be surprised if someone here will smugly contradict me with instructions on how to do just that.
In the near term, if I were hosting such a site I'd be looking into slimming down every byte I could manage, using fingerprinting to serve slim pages to the bots and exploring alternative hosting/CDN options.
> The problem here isn't the crawlers,
One of the worst takes I've seen. Yes, that's expensive, but the individuals doing insane amounts of unnecessary scraping are the problem. Let's not act like this isn't the case.
To clarify the math. Netlify bills $50 for each 100GB over the Pro plan limit of 1TB. Which is the barrel I'm looking down just this month before others get the same idea. So yes, I'm squeezed on both side unless I put the work in to rehost.
I went to a Subway shop that charged $50 per lettuce strip past the first 20. As the worker sprinkled lettuce on my sandwich, I counted anxiously, biting my nails. 19, phew, I'm safe. I think I'll come back here tomorrow.
Tomorrow, someone in front of me asked for extra lettuce. The worker got confused and put it on my sandwich. I was charged $1000. Drat.
> The worker got confused and put it on my sandwich.
No, this is where you're completely and totally incorrect. There is no 'worker accidentally making a human mistake that costs you money' here. This is a 'multi-billion dollar company routinely runs scripts that they KNOW cost you money, but do it anyways because it generates profit for them'. To fix your example,
You RUN a Subway that sells sandwiches. Your lettuce provider charges you $1 per piece of lettuce. Your average customer is given $1 worth of lettuce in their sub. One customer keeps coming in, reaching over the counter, and grabbing handfuls of lettuce. You cannot ban this customer because they routinely put on disguises and ignore your signs saying 'NO EXTRA LETTUCE'. Eventually this bankrupts you, forces you to stop serving lettuce in your subs entirely, or you have to put up bars (eg, Cloudflare) over your lettuce bins.
I'm not sure what Netlify is doing, but the heaviest assets on your website are your javascript sources. Have you considered hosting those on GitHub pages, which has a generous free tier?
The images are from steamcdn-a.akamaihd.net, which I assume is already being hosted by a third-party (Steam)
I'd rather not involve Microsoft but I recognize there are other options. It is additional work/complexity I'll probably have to take on.
Do you have the ability to block ASNs? I help sysadmin a DIY building forum, and we cut 80% of the load from our server by blocking all Alibaba IPs in ASN 45102. Singapore was sending the most bot traffic.
Thank you for making ProtonDB! I use it a ton <3
Please use a default deny on the user agent. It can block a lot of accessability tools and makes privacy difficult.
Did you mean to say don't use a default deny?
Yes
Go for Cloudflare pages.
Your mistake is openly suggesting on HN that you're going to use Cloudflare, increasing the centralization of the internet and contributing to their attestation schemes, while society forces you to be a victim of the tragedy of the commons.
Please believe me that it is not a step I want to take.
Another option that wouldn't contribute to more centralization might be neocities. They give you 3 TB for $5/month. That seems to be _the_ limit though. The dude runs his own CDN just for neocities, so it's not just reselling cloudflare or something.
P.S. Thank you for ProtonDB, it has been so incredibly helpful for getting some older games running.
You don't need to apologize - HN needs to get their heads out of the sand that not everything is a tragedy of the commons, there's a reason why centralization exists, and the decentralized internet as it is now comes with serious drawbacks. We're never going to overcome the popularity of big tech if we can't be honest with the problems they solve.
Also, sue me, the cathedral has defeated the bazaar. This was predictable, as the bazaar is a bunch of stonecutters competing with each other to sell the best stone for building the cathedral with. We reinvented the farmer's market, and thought that if all the farmers united, they could take down Walmart. It's never happening.
In this context, the farmers are trying to deal with rampant abuse that is inconceivable to handle on an individual level.
It's not clear to me what taking down Cloudflare/Walmart means in this context. Nor how banding together wouldn't just incur the very centralization that is presumably so bad it must be taken down.
> Cloud services company Fastly agrees. It reports that 80% of all AI bot traffic comes from AI data fetcher bots.
No kidding. An increasing number of sites are putting up CAPTCHA's.
Problem? CAPTCHAS are annoying, they're a 50 times a day eye exam, and
> Google's reCAPTCHA is not only useless, it's also basically spyware [0]
> reCAPTCHA v3's checkbox test doesn't stop bots and tracks user data
[0] https://www.techspot.com/news/106717-google-recaptcha-not-on...
Webmasters are really kinda stuck between a rock and a hard place with this one.
At least with what I'm doing poorly configured or outright malicious bots consume about 5000x the resources than human visitors, so having no bot mitigation means I've basically given up and decided I should try to make it as a vegetable farmer instead of doing stuff online.
Bot mitigation in practice is a tradeoff between what's enough of an obstacle to keep most of the bots out, while at the same time not annoying the users so much they leave.
I think right now Anubis is one of the less bad options. Some users are annoyed by it (and it is annoying), but it's less annoying than clicking fire hydrants 35 times and as long as you configure right it seems to keep most of the bots out, or at least drives them to behave in a more identifiable manner.
Probably won't last forever, but I don't know what would besides like going full anacap special needs kid and doing crypto microtransactions for each page request. Would unfortunately drive off not only the bots, but the human visitors as well.
Anubis is extremely slow on low-end devices, it often takes >30 seconds to complete. Users deserve better, but I guess it's still a better experience than reCaptcha or Cloudflare.
Well, >30 seconds to complete anubis is still better than >30 seconds to complete every single page load because AI bots are overloading the servers.
I've just started clicking away from pages that are full of CAPTCHAs. Ironically this has resulted in me using AI more.
Ironic part ... LLM are very good as solving CAPTCHA's. So the only people bothered by those same CAPTCHA's are the actual site visitors.
What sites need to do is temp block repeat request from the same IPs. Sure, some agents use 10.000's of IP's but if they are really so aggressive as people state, your going to run into the same IP's way more often then normal users.
That will kick out the over aggressive guys. I have done web scraping and limited it to around 1r/s. You never run into any blocking or detection that way because you hardly show up. But when you have some *** that send 1000's off parallel request down a website, because they never figured out query builders for large page hits. And do not know how to build checks to see from last-update pages.
One of the main issues i see, is some people simply write the most basic of basic scrapers. See link, follow, spawn process, scrap, see 100 more links ... Updates? Just rescrap website, repeat, repeat... Because it takes time to make a scrap template for each website, that knows where to check for updated. So some never bother.
I often use a VPN or iCloud private relay. Some sites gripe “too many accesses (downloads) from your IP address today.”
The devil’s in the details. I (a non-bot) sometimes resort to VPN-flipping.
I suppose that some bots try this, just a wild guess.
And because companies like Fastly only measure things via javascript execution and assume everything that doesn't execute JS correctly is a bot, that 80% contains a whole bunch of human persons.
The Fastly report[1] has a couple of great quotes that mention Common Crawl's CCBot:
> Our observations also highlight the vital role of open data initiatives like Common Crawl. Unlike commercial crawlers, Common Crawl makes its data freely available to the public, helping create a more inclusive ecosystem for AI research and development. With coverage across 63% of the unique websites crawled by AI bots, substantially higher than most commercial alternatives, it plays a pivotal role in democratizing access to large-scale web data. This open-access model empowers a broader community of researchers and developers to train and improve AI models, fostering more diverse and widespread innovation in the field.
...
> What’s notable is that the top four crawlers (Meta, Google, OpenAI and Claude) seem to prefer Commerce websites. Common Crawl’s CCBot, whose open data set is widely used, has a balanced preference for Commerce, Media & Entertainment and High Tech sectors. Its commercial equivalents Timpibot and Diffbot seem to have a high preference for Media & Entertainment, perhaps to complement what’s available through Common Crawl.
And also there's one final number that isn't in the Fastly report but is in the EL Reg article[2]:
> The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.
1: https://learn.fastly.com/rs/025-XKO-469/images/Fastly-Threat...
2: https://www.theregister.com/2025/08/21/ai_crawler_traffic/
It's really bad for anyone using anything other than Chrome to browse the web, or any accessability tools or privacy software, because a bunch of sites will now block you, assuming you're a web crawler.
Blogspam just linking to a bunch of prior reports and posts from earlier in the year.
Some ongoing recent discussion:
Cloudflare Radar: AI Insights
https://news.ycombinator.com/item?id=45093090
The age of agents: cryptographically recognizing agent traffic
https://news.ycombinator.com/item?id=45055452
That Perplexity one:
Perplexity is using stealth, undeclared crawlers to evade no-crawl directives
https://news.ycombinator.com/item?id=44785636
AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders
https://news.ycombinator.com/item?id=44971487
This has been widely reported for months now. Anthropic just reported another $13B in funding. Clearly, the companies just do not care to invest any effort to improving their behavior.
Can I ask a stupid question? Why is this so much worse than what they were doing to gather articles for traditional search engines? I assume that they are gathering pretty much the same data? It is the same articles, no?
— I just realized these are callouts from the LLM on behalf of the client. I can see how this is problematic but it does seem like there should be a way to cache that
No, the traffic is not caused by client requests (like when your chat gpt session does a search and checks some sources). They are caused by training runs. The difference is that AI companies are not storing the data they scrape. They let the model ingest the data, then throw it away. When they train the next model, they scrape the entire Internet again. At least that's how I understand it.
People who didnt respect basic ethics, legal copyrights and common sense aren't gonna stop because they're a nuisance. They'll keep at it until they've ruined what birthed them so they may replace it. Fuck AI.
Why don't sites just start publishing a dump of their site that crawlers could pull instead? I realize that won't work for dynamic content, but surely a lot of these "small" sites that are out there which are currently getting hammered, are not purely dynamic content?
Maybe we could just publish a dump, in a standard format (WARC?), at a well-known address, and have the crawlers check there? The content could be regularly updated, and use an etag/etc so that crawlers know when its been updated.
I suspect that even some dynamic sites could essentially snapshot themselves periodically, maybe once every few hours, and put it up for download to satiate these crawlers while keeping the bulk of the serving capacity for actual humans.
Because crawlers aren't concerned about the bandwidth of the sites they crawl and will simply continue to take everything, everywhere, all the time regardless of what sites do.
Also it's unfair to expect every small site to put in the time and effort to, in essence, pay the Danegeld to AI companies just for the privilege of their continued existence. It shouldn't be the case that the web only exists to feed AI, or that everyone must design their sites around feeding AI.
"It used to be when search indexing crawler, Googlebot, came calling, I could always hope that some story on my site would land on the magical first page of someone's search results so they'd visit me, they'd read the story, and two or three times out of a hundred visits, they'd click on an ad, and I'd get a few pennies of income."
Perhaps the AI crawlers can "click on some ads"
Can we get an AI-powered tarpit for these crawlers?
Add a hidden link, put it in robots.txt
A crawler hits that link, a light-on-resources language model produces infinite amounts of plausible-looking gibberish for them to crawl with links and everything.
My site isn't cool enough to get slammed by crawlers. Except for one Chinese bot that just will not give up. Petal bot? Something like that.
I have Cloudflare's anti-bot thing turned on and OpenAI and Anthropic appear to either respect my rule or be stopped by it.
Is this data being collected for training sets? That seems problematic. I can't be the only one who's noticed that the web is quickly filling up with AI generated clickbait (which has made using a search engine more difficult).
Today most is probably live crawling during AI query resolution.
no, you get tons of infinite link web following
I just block them by User Agent string[1]. The rest that fake the UA get clobbered by rate limiting[2] on the web server. Not perfect, but our site is not getting hammered any more.
[1] https://perishablepress.com/ultimate-ai-block-list/
[2] https://github.com/jzdziarski/mod_evasive
There is a very large scale crawler that uses random valid user agents and a staggeringly large pool of ips. I first noticed it because a lot of traffic was coming from Brazil and "HostRoyale" (asn 203020). They send only a few requests a day from each ip so rate limiting is not useful.
I run a honeypot that generates urls with the source IP so I am pretty confident it is all one bot, in the past 48 hours I have had over 200,000 ips hit the honeypot.
I am pretty sure this is Bytedance, they occasionally hit these tagged honeypot urls with their normal user agent and their usual .sg datacenter.
My site has also recently been getting massively hit by Brazilian IPs. It lasts for a day or two, even if they are being blocked.
I wonder if you could implement a dummy rate limit? Half the time you are rate limited randomly. A real user will think nothing of it and refresh the page.
That will irritate real users half the time while the bots won't care.
If they are a real user going on your site in 2025 then they have no alternative they are even interested in. They will blame their ISP and wait.
Meanwhile rate limiting the llm could potentially cost a lot of money in time and compute to people who don’t have our best interests at heart. Seems like a win to me.
I've written my own bots that do exactly this. My reason was mainly to avoid detection so as part of that I also severely throttled my requests and hit the target at random intervals. In other words, I wasn't trying to abuse them. I just didn't want them to notice me.
TLDR it's trivial to send fake info when you're the one who controls the info.
The world today is just crooks all the way down (or up, depending on how you look at it).
At the same time, a lot of HN push back for new solutions like Signed Agents by CF
https://news.ycombinator.com/item?id=45066258
Because it's a bad solution. The core problem is that the internet is vulnerable to DDoS attacks and the web has no native sybil resistance mechanism.
Cloudflare's solution to every problem is to allow them to control more of the internet. What happens when they have enough control to do whatever they want? They could charge any price they want.
I'm more afraid of the orgs that are gaining enough control of knowledge, cognition, and creativity that they'll be able to charge any price for them once they've trained us out of practicing them ourselves.
There Is No Moat
Until crawling without being on people's whitelist becomes sufficiently difficult
The idea itself has merit, even if the implementation is questionable.
Giving bots a cryptographic identity would allow good bots to meaningfully have skin in the game and crawl with their reputation at stake. It's not a complete solution, but could be part of one. Though you can likely get the good parts from HTTP request signing alone, Cloudflare's additions to that seem fairly extraneous.
I honestly don't know what is a good solution. The status quo is certainly completely untenable. If we keep going like we are now, there won't be a web left to protect in a few years. It's worth keeping in mind that there's an opportunity cost, and even a bad solution may be preferrable to no solution at all.
... I say operating an independent web crawler.
I think the solution is some sort of PoW gateway like people are setting up now. Or a micropayments system where each page request costs a fraction of a penny.
You could combine that with some sort of IPFS/Bittorrent like system where you allow others to rehost your static content, indexed by the merkle hash of the content. That would allow users to donate bandwidth.
I really don't like the idea that you can get out of this by surveiling user agents more or distinguishing between "good" and "bad" bots which is a massive social problem.
Nobody wants proof of work, leave the block chain inspired nonsense to the crypto children
This is the original use-case for proof of work though. The idea of using proof of work to hinder spam dates back to 1997[1]
[1] https://en.wikipedia.org/wiki/Hashcash
Hope we’re including this in the energy usage for LLMs.
I think we are just reaping the delayed storm of the insanely inefficient web we have created over the past decades.
There is absolutely no need for vast majority of websites to use databases and SSR, most of the web can be statically rendered and cost peanuts to host, but alas WP is the most popular "framework"
From my pov (saas, 1k domains): the most destructive/DDoS/idiotic brute force crawling peaked around half a year ago.
captchas have eliminated 25 years of browsing speed progress.
And so the mantle has been passed from the Javascript developer, to the Turing test author.
“I drink your milkshake” type sh
What if content providers reduced the 30k word page for a recipe down to just the actual recipe, would this reduce the amount of data these bots are pulling down?
I don't see this slowing down. If websites don't adapt to the AI deep search reality, the bot will just go somewhere else. People don't want to read these massive long form pages geared at outdated Google SEO techniques.
You're painting this as a problem that is somehow related to overly long form text based web pages. It isn't. If you host a local cleaning company site, or a game walkthrough site, or a roleplaying forum, the bots will flood the gates all the same.
You are right that it doesn't look like it is slowing down, but the developing result of this will not be people posting a shorter recipe, it will be a further contraction of the public facing, open internet.
Kinda funny to see someone casually mention a roleplaying forum. That's what I run and it got 10x traffic overnight from AI bots.
Made it when I was a teenager and got stuck running it the rest of my life.
Of course, the bots go super deep into the site and bust your cache.
you'd be within your right to serve the bots fake gibberish data.
Maybe they'll crawl less when it starts damaging models.
That's only a stopgap measure, eventually they'll realize what's happening and use distributed IPs and fake user agents to look like normal users. The Tencent and Bytedance scrapers are already doing this.
Text content at the sub-page level is approximately 0% of web traffic. It's a non-issue.
Then remove the 30MB of advertisements, serve just the 3kB of text, and your server load will be completely fine.