AI companies and notably AI scrapers are a cancer that is destroying what's left of the WWW.
I was hit with a pretty substantial botnet "distributed scraping" attack yesterday.
- About 400,000 different IP addresses over about 3 hours
- Mostly residential IP addresses
- Valid and unique user agents and referrers
- Each IP address would make only a few requests with a long delay in between requests
It would hit the server hard until the server became slow to respond, then it would back off for about 30 seconds, then hit hard again. I was able to block most of the requests with a combination of user agent and referrer patterns, though some legit users may be blocked.
The attack was annoying, but, the even bigger problem is that the data on this website is under license - we have to pay for it, and it's not cheap. We are able to pay for it (barely) with advertising revenue and some subscriptions.
If everyone is getting this data from their "agent" and scrapers, that means no advertising revenue, and soon enough no more website to scrape, jobs lost, nowhere for scrapers to scrape for the data, nowhere for legit users to get the data for free, etc.
Thanks for sharing the perspective here. I think a lot of folks on HN have rightly said that a lot of the problems with the modern internet are due to the ad-supported business model. I don't think you were ever going to move away from it voluntarily -- too many people support it, even if they grumble about it.
But maybe (and likely for worse) LLMs will finally kill this model.
I would love for the ad-supported model to die. I hate ads, and I hate having to serve ads. We get some subscription users but nowhere near enough to cover costs.
Unfortunately, what I think will happen - and indeed already is - is that the AI companies themselves will replace much of the WWW. Sites like the one I am talking about will cease to exist. AI companies, once they can no longer scrape (steal) the data will end up licensing the data themselves and replace us as the distributor to end users. Perhaps as a subscription add-on or also with an ad based model.
Which to some may be fine. Personally, I don't want a few centralized AI companies replacing the hundreds of thousands of independent websites online. Way too much centralized power there.
If you don't mind me asking, what sort of data are you licensing? I noticed that you explicitly don't mention it.
Do you not run Anubis or have strict fail2ban rules? I just straight up ban IPs forever if they lookup files that will never exist on my servers. That plus Anubis with the strictest settings.
https://anubis.techaro.lol/
Fail2ban doesn't scale well to these volumes of traffic and request patterns.
Just like fail2ban is not very useful against a DDOS attack where each unique IP only makes a few requests with a large (hour+) delay in between requests. There is no clear "fail" in these requests, and the fail2ban database becomes huge and far too slow.
- 400,000 Unique IP addresses
- 1 to 3 requests per hour per IP addresses - with delays of over 60 minutes between each request.
- Legit request URLs, legit UA & referrer
Maybe Anubis would help, but it's also a risk for various reasons.
At some point there needs to be a check if it's a real human... But it's a cat and mouse game - any way we create to keep bots off gets a work around by clever engineers.
Don’t worry, man, once AGI is here you’ll get your allowance (or whatever the hyperscalers plan is).
You’ll enjoy painting or some other art even if you aren’t interested in the arts. That’s what I’ve seen written about it.