Why don't sites just start publishing a dump of their site that crawlers could pull instead? I realize that won't work for dynamic content, but surely a lot of these "small" sites that are out there which are currently getting hammered, are not purely dynamic content?
Maybe we could just publish a dump, in a standard format (WARC?), at a well-known address, and have the crawlers check there? The content could be regularly updated, and use an etag/etc so that crawlers know when its been updated.
I suspect that even some dynamic sites could essentially snapshot themselves periodically, maybe once every few hours, and put it up for download to satiate these crawlers while keeping the bulk of the serving capacity for actual humans.
Because crawlers aren't concerned about the bandwidth of the sites they crawl and will simply continue to take everything, everywhere, all the time regardless of what sites do.
Also it's unfair to expect every small site to put in the time and effort to, in essence, pay the Danegeld to AI companies just for the privilege of their continued existence. It shouldn't be the case that the web only exists to feed AI, or that everyone must design their sites around feeding AI.