This is very hacker-like thinking, using tech's biases against it!
I can't help but feel like we're all doing it wrong against scraping. Cloudflare is not the answer, in fact, I think that they lost their geek cred when they added their "verify you are human" challenge screen to become the new gatekeeper of the internet. That must remain a permanent stain on their reputation until they make amends.
Are there any open source tools we could install that detect a high number of requests and send those IP addresses to a common pool somewhere? So that individuals wouldn't get tracked, but bots would? Then we could query the pool for the current request's IP address and throttle it down based on volume (not block it completely). Possibly at the server level with nginx or at whatever edge caching layer we use.
I know there may be scaling and privacy issues with this. Maybe it could use hashing or zero knowledge proofs somehow? I realize this is hopelessly naive. And no, I haven't looked up whether someone has done this. I just feel like there must be a bulletproof solution to this problem, with a very simple explanation as to how it works, or else we've missed something fundamental. Why all the hand waving?
Your approach to GenAI scrapers is similar to our fight with email spam. The reason email spam got solved was because the industry was interested in solving it. But this issue got the industry split: without scraping, GenAI tools are less functional. And there is some serious money involved, so they will use whatever means necessary, technical and legal, to fight such initiatives.
I've been exploring decentralized trust algorithms lately, and so reading this was nice. I've a similar intuition - for every advance in scraping detection, scrapers will learn too, and so it's an ongoing war of mutations, but no real victor.
The internet has seen success with social media content moderation and so it seems natural enough that an application could exist for web traffic itself. Hosts being able to "downvote" malicious traffic, and some sort of decay mechanism given IP's recycling. This exists in a basic sense with known TOR exit nodes and known AWS, GCP IP's, etc.
That said, we probably don't have the right building blocks yet, IP's are too ephemeral, yet anything more identity-bound is a little too authoritarian IMO. Further, querying something for every request is probably too heavy.
Fun to think about, though.
Scrapers use residential IP proxies, so blocking based on IP addresses is not a solution.
maybe some proof of work scheme to load page content with increasing difficulty based on ip address behavior profiling.
Firewall.