Hacker News

> And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.

I'm sorry, but this statement shows you have no recent experience with these crawlernets.

Google, from the beginning, has done their best to work with server owners. They respect robots.txt. I think they were the first to implement Crawl-Delay. They crawl based on how often things change anyway. They have an additional safeguard that when they notice a slowdown in your responses, they back off.

Compare this with Anthropic. On their website they say they follow robots.txt and Crawl-Delay. I have an explicit ban on Claudebot in there and a Crawl-Delay for everyone else. It ignores both. I send an email to them about this, and their answer didn't address the discrepancy between the docs and the behaviour. They just said they'll add me to their internal whitelist and that I should've sent 429s when they were going too fast. (Fuck off, how about you follow your public documentation?)

That's just my experience, but if you Google around you'll find that Anthropic is notorious for ignoring robots.txt.

And still, Claudebot is one of the better behaved bots. At least they identify themselves, have a support email they respond to, and use identifiable IP-addresses.

A few weeks ago I spend four days figuring out why I had 20x the traffic I normally have (which maxed out the server; causing user complaints). Turns out there are parties that crawl using millions of (residential) IPs and identify themselves as normal browsers. Only 1 or 2 connections per IP at the time. Randomization of identifying properties. Even Anthropics 429 solution wouldn't have worked there.

I managed to find a minor identifying property in some of the requests that wasn't catching too many real users. I used that to start firewalling IPs on sight and then their own randomization caused every IP to fall into the trap in the end. But it took days.

In the end I had to firewall nearly 3 million non-consecutive IP addresses.

So no, Google in 1996 or 2006 or 2016 is not the same as the modern DDoSing crawlernet.

ijk 5 days ago [ - ]

I am still a bit confused by what some of these crawlers are getting out of it; repeatedly crawling sites that haven't changed seems to be the norm for the current crawlernets, which seems like a massive waste of resources on their end for what is, on average, data of rather indifferent quality.

immibis 5 days ago [ - ]

Nothing. They're not designed to be useful. They're designed to grab as much data as possible and they'll figure out what to do with it later - they don't know it's mostly useless yet.

Tarpits are cool.

account42 3 days ago [ - ]

Did you send any abuse reports to the ASNs for those IP addresses?