Hacker News

Yeah, that's where IP intelligence comes in. They're using pretty big IP pools, so, either you're manually adding individual IPs to a list all day (and updating that list as ASNs get continuously shuffled around), or you've got a process in the background that essentially does whois lookups (and caches them, so you aren't also being abusive), parses the metadata returned, and decides whether that request is "okay" or not.

The classic 80/20 rule applies. You can catch about 80% of lazy crawler activity pretty easily with something like this, but the remaining 20% will require a lot more effort. You start encountering edge cases, like crawlers that use AWS for their crawling activity, but also one of your customers somewhere is syncing their WooCommerce orders to their in-house ERP system via a process that also runs on AWS.

asddubs 4 days ago [ - ]

I've had crawlers get stuck in a loop before on a search page where you basically could just keep adding things, even if there are no results. I filtered requests that are bots for sure (requests which are specified long past the point of any results). It was over a million unique IPs, most of which only doing 1 or 2 requests on their own (from many different ip blocks)