I wonder how many CPU cycles are spent because of AI companies scraping content. This factor isn't usually considered when estimating “environmental impact of AI.” What’s the overhead of this on top of inference and training?
To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.