Does anyone know what's the deal with these scrapers, or why they're attributed to AI?
I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.
Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.
I would love to understand this.
Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice.
"It's for AI" feels like lazy reasoning for me... but what IS it for?
One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers?
For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material.
May be everyone is trying to take advantage of the situation before law eventually catches up.
> why they're attributed to AI?
I don’t think they mean scrapers necessarily driven by LLMs, but scrapers collecting data to train LLMs.
I stopped trying to understand. Encountering a 404 on my site leads directly to a 1 year ban.
I’m guessing, but I think a big portion of AI requests now come from agents pulling data specifically to answer a user’s question. I don’t think that data is collected mainly for training now but are mostly retrieved and fed into LLMs so they can generate the response. Thus so many repeated requests.
There's value to be had in ripping the copyright off your stuff so someone else can pass it off as their stuff. LLMs have no technical improvements so all they can do is throw more and more stolen data into it and hope it, somehow, crosses a nebulous "threshold" where it suddenly becomes actually profitable to use and sell.
It's a race to the bottom. What's different is we're much closer to the bottom now.
> If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.
Right, this is exactly what they are.
They're written by people who a) think they have a right to every piece of data out there, b) don't have time (or shouldn't have to bother spending time) to learn any kind of specifics of any given site and c) don't care what damage they do to anyone else as they get the data they crave.
(a) means that if you have a robots.txt, they will deliberately ignore it, even if it's structured to allow their bots to scrape all the data more efficiently. Even if you have an API, following it would require them to pay attention to your site specifically, so by (b), they will ignore that too—but they also ignore it because they are essentially treating the entire process as an adversarial one, where the people who hold the data are actively trying to hide it from them.
Now, of course, this is all purely based on my observations of their behavior. It is possible that they are, in fact, just dumb as a box of rocks...and also don't care what damage they do. (c) is clearly true regardless of other specific motives.