AI crawlers have lead to a big surge in scraping activity, and most of these bots don't respect any scraping best practices that the industry has developed over the past two decades (robots.txt, rate limits, user agents, etc.).
This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported here on HN (and experienced myself).
Does Webhound respect robots.txt directives and do you disclose the identity of your crawlers via user-agent header?
We currently use Firecrawl for our crawling infrastructure. Looking at their documentation, they claim to respect robots.txt, but based on user reports in their GitHub issues, the implementation seems inconsistent - particularly for one-off scrapes vs full crawls.
This is definitely something we need to address on our end. Site owners should have clear ways to opt out, and crawlers should be identifiable. We're looking into either working with Firecrawl to improve this or potentially switching to a solution that gives us more control over respecting these standards.
Appreciate you bringing this up.
Firecrawl is egregiously expensive