Hacker News

> Cloud services company Fastly agrees. It reports that 80% of all AI bot traffic comes from AI data fetcher bots.

No kidding. An increasing number of sites are putting up CAPTCHA's.

Problem? CAPTCHAS are annoying, they're a 50 times a day eye exam, and

> Google's reCAPTCHA is not only useless, it's also basically spyware [0]

> reCAPTCHA v3's checkbox test doesn't stop bots and tracks user data

[0] https://www.techspot.com/news/106717-google-recaptcha-not-on...

Webmasters are really kinda stuck between a rock and a hard place with this one.

At least with what I'm doing poorly configured or outright malicious bots consume about 5000x the resources than human visitors, so having no bot mitigation means I've basically given up and decided I should try to make it as a vegetable farmer instead of doing stuff online.

Bot mitigation in practice is a tradeoff between what's enough of an obstacle to keep most of the bots out, while at the same time not annoying the users so much they leave.

I think right now Anubis is one of the less bad options. Some users are annoyed by it (and it is annoying), but it's less annoying than clicking fire hydrants 35 times and as long as you configure right it seems to keep most of the bots out, or at least drives them to behave in a more identifiable manner.

Probably won't last forever, but I don't know what would besides like going full anacap special needs kid and doing crypto microtransactions for each page request. Would unfortunately drive off not only the bots, but the human visitors as well.

timpera 4 days ago [ - ]

Anubis is extremely slow on low-end devices, it often takes >30 seconds to complete. Users deserve better, but I guess it's still a better experience than reCaptcha or Cloudflare.

danudey 4 days ago [ - ]

Well, >30 seconds to complete anubis is still better than >30 seconds to complete every single page load because AI bots are overloading the servers.

ronsor 5 days ago [ - ]

I've just started clicking away from pages that are full of CAPTCHAs. Ironically this has resulted in me using AI more.

benjiro 4 days ago [ - ]

Ironic part ... LLM are very good as solving CAPTCHA's. So the only people bothered by those same CAPTCHA's are the actual site visitors.

What sites need to do is temp block repeat request from the same IPs. Sure, some agents use 10.000's of IP's but if they are really so aggressive as people state, your going to run into the same IP's way more often then normal users.

That will kick out the over aggressive guys. I have done web scraping and limited it to around 1r/s. You never run into any blocking or detection that way because you hardly show up. But when you have some *** that send 1000's off parallel request down a website, because they never figured out query builders for large page hits. And do not know how to build checks to see from last-update pages.

One of the main issues i see, is some people simply write the most basic of basic scrapers. See link, follow, spawn process, scrap, see 100 more links ... Updates? Just rescrap website, repeat, repeat... Because it takes time to make a scrap template for each website, that knows where to check for updated. So some never bother.

k310 4 days ago [ - ]

I often use a VPN or iCloud private relay. Some sites gripe “too many accesses (downloads) from your IP address today.”

The devil’s in the details. I (a non-bot) sometimes resort to VPN-flipping.

I suppose that some bots try this, just a wild guess.

superkuh 5 days ago [ - ]

And because companies like Fastly only measure things via javascript execution and assume everything that doesn't execute JS correctly is a bot, that 80% contains a whole bunch of human persons.

ccgreg 5 days ago [ - ]

The Fastly report[1] has a couple of great quotes that mention Common Crawl's CCBot:

> Our observations also highlight the vital role of open data initiatives like Common Crawl. Unlike commercial crawlers, Common Crawl makes its data freely available to the public, helping create a more inclusive ecosystem for AI research and development. With coverage across 63% of the unique websites crawled by AI bots, substantially higher than most commercial alternatives, it plays a pivotal role in democratizing access to large-scale web data. This open-access model empowers a broader community of researchers and developers to train and improve AI models, fostering more diverse and widespread innovation in the field.

...

> What’s notable is that the top four crawlers (Meta, Google, OpenAI and Claude) seem to prefer Commerce websites. Common Crawl’s CCBot, whose open data set is widely used, has a balanced preference for Commerce, Media & Entertainment and High Tech sectors. Its commercial equivalents Timpibot and Diffbot seem to have a high preference for Media & Entertainment, perhaps to complement what’s available through Common Crawl.

And also there's one final number that isn't in the Fastly report but is in the EL Reg article[2]:

> The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.

1: https://learn.fastly.com/rs/025-XKO-469/images/Fastly-Threat...

2: https://www.theregister.com/2025/08/21/ai_crawler_traffic/