Does mass scraping need google for content discovery? Surely most sites contain a site map or index that would effectively self enumerate once you know the domain, which is more often than not publicly disclosed?
Does mass scraping need google for content discovery? Surely most sites contain a site map or index that would effectively self enumerate once you know the domain, which is more often than not publicly disclosed?
What matters is when websites put this new version of reCAPTCHA on their site, just like archive.is has done. Then the scrapers will have a hard time getting around that.