I had to block meta's ASN on my personal cgit server a few weeks ago because they were ignoring robots.txt and torching it. Like hundreds of megabytes of access logs just from them, spread around different network blocks to clearly try and defeat IP based limiting. I couldn't believe it.
I had to last year too, nonstop crawling, random urls that didn't exist. It looked like they were trying to proxy user queries through to a search endpoint too. The ASN matched so I know it wasn't someone spoofing them.
IMO ASN-based blocking should be much more common, but unfortunately it is not supported as a first-class configuration option in many common tools.
Yeah, I dont know how anybody stays sane without it. I have a list of over a thousand ASNs I blackhole at this point...
Mine is a daily bash cronjob that fetches a text-based database and uses grep to build an nftables-apply script with all the IPs for the blocked ASNs. I keep meaning to share it, but it's embarrassingly messy I haven't had time to clean it up...
It's been a real game of cat and mouse over the last few years. I used to do daily iptables updates to block repeat scrapers on my small niche stats site I run. About 5-6 ago it become more common to see broader ranges - so I started blocking ASNs which worked great (esp for the regulars like Alibaba, Tencent, compromised DigitalOcean/OVH, ...). In the last 2-3 years though the overall bot traffic has skyrocketed - it's easy to spot bot activity after the fact (no requests to the CDN for static assets, user agent changes from one request to the next, predictable ID enumeration, etc) but not in a real time. They're also often using residential-based proxies and Cloudflare bot detection has become pretty bad.
Arms races suck. I've managed to find a few L7 tricks to catch the residential proxies and serve them an empty 200, but there are obvious trivial workarounds on the other end and if I start talking about them in public they won't last long... I wish I could share :/
Cloudflare is so easy to defeat and almost everyone in the scrapping industry is selling solutions that automatically bypass, hcaptcha solving is also really cheap nowadays.
It would still be useful to share as an example and reference point. People can use Claude Code / etc. to re-write it to their specific situation.
It would break the internet to make this available to the average person. A large swath would actively choose to block stuff like: all of Meta, Alphabet, Apple, Amazon, etc etc etc.
Anyhoo, now you mention it this is the tack I am going to take in my own network, thanks!
Nah, they'd just pay botnet operators a few thousand bucks for proxy services.
It's a real pain in the ass because in the absence of ASN based blocking, you often have to give something a long list of IP ranges in CIDR notation, and be certain you don't "miss" even one ipv4 /23 or /24 or a crawler will get through.
[dead]
Hey, how do you identify them? Is there a service to recognize which of these companies scrapped you?
Every few weeks I run my nginx access logs through a script that uses the same textual ASN database to tally them up and spit out a summary report. There are many different sources for periodic textual ASN databases you can parse with UNIXy tools.
[flagged]
The world would be a much better place if these kinds of engineers had a spine.
Yeah they’d have to use it to stand at the back of the unemployment line. Companies don’t care, someone more desperate will take the job.
Are you one of those engineers building said crawlers, by any chance?
Some spines are just crooked, and the extra rigidity would hurt more than help.
"One moment: reticulating spines..."
They could even feed 20 kids