We currently use Firecrawl for our crawling infrastructure. Looking at their documentation, they claim to respect robots.txt, but based on user reports in their GitHub issues, the implementation seems inconsistent - particularly for one-off scrapes vs full crawls.
This is definitely something we need to address on our end. Site owners should have clear ways to opt out, and crawlers should be identifiable. We're looking into either working with Firecrawl to improve this or potentially switching to a solution that gives us more control over respecting these standards.
Appreciate you bringing this up.
Firecrawl is egregiously expensive