Well, if you have a better way to solve this that’s open I’m all ears. But what Cloudflare is doing is solving the real problem of AI bots. We’ve tried to solve this problem with IP blocking and user agents, but they do not work. And this is actually how other similar problems have been solved. Certificate authorities aren’t open and yet they work just fine. Attestation providers are also not open and they work just fine.
> Well, if you have a better way to solve this that’s open I’m all ears.
Regulation.
Make it illegal to request the content of a webpage by crawler if a website operator doesn't explicitly allows it via robots.txt. Institute a government agency that is tasked with enforcement. If you as a website operator can show that traffic came from bots, you can open a complaint with the government agency and they take care of shaking painful fines out of the offending companies. Force cloud hosts to keep books on who was using what IP addresses. Will it be a 100% fix, no, will it have a massive chilling effect if done well, absolutely.
The biggest issue right now seems to be people renting their residential IP addresses to scraper companies, who then distribute large scrapes across these mostly distinct IPs. These addresses are from all over the world, not just your own country, so we'll either need a World Government, or at least massive intergovernmental cooperation, for regulation to help.
I don't think we need a world government to make progress on that point.
The companies buying these services, are buying them from other companies. Countries or larger blocks like the EU can exert significant pressure on such companies by declaring the use of such services as illegal when interacting with websites hosted in the country or block or by companies in them.
It just seems too easy to skirt around via middlemen. The EU (say) could prosecute an EU company directly doing this residential scraping, and it could probably keep tabs on a handful of bank accounts of known bad actors in other countries, and then investigate and prosecute EU companies transferring money to them. But how do you stop an EU company paying a Moldovan company (that has existed for 10 days) for "internet services", that pays a Brazilian company, that pays a Russian company to do the actual residential scraping? And then there's all the crypto channels and other quid pro quo payment possibilities.
Genuinely this isn't a tech specific or even novel problem. There is plenty of prior art when it comes to inhibiting unwanted behavior.
> But how do you stop an EU company paying a Moldovan company (that has existed for 10 days) for "internet services", that pays a Brazilian company, that pays a Russian company to do the actual residential scraping?
The same example could be made with money laundering, and yes it's a real and sizable issue. Yet, the majority of money is not laundered. How does the EU company make sure it will not be held liable, especially the people that made the decision? Maybe on a technical level the perfect crime is possible and not getting caught is possible or even likely given a certain approach. But the uncertainty around it will dissuade many, not all. The same goes for companies selling the services, you might think you have a foolproof way to circumvent the measures put in play, but what if not and the government comes knocking?
Your money laundering analogy is apt. I know very little about that topic, and I especially don't know how much money laundering is really out there (nor do governments), but I'm confident that a lot is. Do AML laws have a chilling effect on it? I think they must, since they surely increase the cost and risk, and similar legislation for scraping should have a similar effect. But AML is a pretty bad solution to money laundering, and I despair if AML-for-scraping is the best possible solution to scraping.
I'm not anti-government, but a technical solution that elliminates the the problem is infinitely better than regulating around it.
The internet is too big and distributed to regulate. Nobody will agree on what the rules should be, and certain groups or countries will disagree in any case and refuse to enforce them.
Existing regulation rarely works, and enforcement is half-assed, at best. Ransomware is regulated and illlegal, but we see articles about major companies infected all the time.
I don't think registering with Cloudflare is the answer, but regulation definitely isn't the answer.
The problem is that a technical solution is impossible.
> Institute a government agency that is tasked with enforcement.
You're forgetting about the first W in WWW...
So what you're saying is that if I were to host a bit torrent tracker in Sweden then the US can't do anything about it?
[flagged]
Agreed. It might not be THE BEST solution, but it is a solution that appears to work well.
Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.
yep, that's why I am writing this now :)
You can see it in the web vs mobile apps.
Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)
Are they? Until Let's Encrypt came along and democratise the CA scene, it was a hell hole. Web Security was depending on how deep your pockets are. One can argue that the same path is being laid in front us until a Let's Encrypt comes along and democratise it? And here as it's about attestation, how are we going to prevent gatekeeper's doing "selective attestations with arguable criteria"? How will we prevent political forces?
Certificate authorities don't block humans if they 'look' like a bot
AI poisoning is a better protection. Cloudflare is capable of serving stashes of bad data to AI bots as protective barrier to their clients.
AI poisoning is going to get a lot of people killed, be cause the AI won't stop being used.
The current state of the art in AI poisoning is Nightshade from the University of Chicago. It's meant to eventually be an addon to their WebGlaze[1] which is an invite-only tool meant for artists to protect their art from AI mimicry
Nobody is dying because artists are protecting their art
[0] https://nightshade.cs.uchicago.edu/whatis.html
[1] https://glaze.cs.uchicago.edu/webglaze.html
By that logic AI already killing people. We can't presume that whatever can be found on the internet is reliable data, can't we?
If science taught us anything it's that no data is ever reliable. We are pretty sure about so many things, and it's the best available info so we might as well use it, but in terms of "the internet can be wrong" -> any source can be wrong! And I'd not even be surprised if internet in aggregate (with the bot reading all of it) is right more often than individual authors of pretty much anything
Yet we use it every day for police, military, and political targeting with economic and kinetic consequences.
You mean incompetent users of AI will get people killed. You don't get a free pass because you used a tool that sucked.
This is some next level blame shifting. Next you are going to steal motor oil and then complain that your customers got sick when you used it to cook their food.
Okay, let them
You don't think that the AI companies will take efforts to detect and filter bad data for training? Do you suppose they are already doing this, knowing that data quality has an impact on model capabilities?
The current state of the art in AI poisoning is Nightshade from the University of Chicago. It's meant to eventually be an addon to their WebGlaze[1] which is an invite-only tool meant for artists to protect their art from AI mimicry
If these companies are adding extra code to bypass artists trying to protect their intellectual property from mimicry then that is an obvious and egregious copyright violation
More likely it will push these companies to actually pay content creators for the content they work on to be included in their models.
[0] https://nightshade.cs.uchicago.edu/whatis.html
[1] https://glaze.cs.uchicago.edu/webglaze.html
Seems like their poisoning is something that shouldn't be hard to detect and filter on. There is enough perturbation to create visual artifacts people can see. Steganography research is much further along in being undetectable. I would imaging in order to disrupt training sufficiently, you would not be able to have so few perturbations that it would go undetected
They will learn to pay for high quality data instead of blindly relying on internet contents.
I'm not sure if things are as fine as you say they are. Certificate authorities were practically unheard of outside of corporate websites (and even then mostly restricted to login pages) until Let's Encrypt normalized HTTPS. Without the openness of Let's Encrypt, we'd still be sharing our browser history and search queries with our ISPs for data mining. Attestation providers have so far refused to revoke attestation for known-vulnerable devices (because customers needing to replace thousands of devices would be an unacceptable business decision), making the entire market rather useless.
That said, what I am missing from these articles is an actual solution. Obviously we don't want Cloudflare from becoming an internet gatekeeper. It's a bad solution. But: it's a bad solution to an even worse problem.
Alternatives do exist, even decentralised ones, in the form of remote attestation ("can't access this website without secure boot and a TPM and a known-good operating system"), paying for every single visit or for subscriptions to every site you visit (which leads to centralisation because nobody wants a subscription to just your blog), or self-hosted firewalls like Anubis that mostly rely on AI abuse being the result of lazy or cheap parties.
People drinking the AI Kool-Aid will tell you to just ignore the problem, pay for the extra costs, and scale up your servers, because it's *the future*, but ignoring problems is exactly why Cloudflare still exists. If ISPs hadn't ignored spoofing, DDoS attacks, botnets within their network, """residential proxies""", and other such malicious acts, Cloudflare would've been an Akamai competitor rather than a middle man to most of the internet.