Everyone loves the dream of a free for all and open web.
But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?
Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.
Great video: https://www.youtube.com/shorts/M0QyOp7zqcY
> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?
Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
No, that is not true. It is only true if you just equate "AI training bots" with "people" on some kind of nominal basis without considering how they operate in practice.
It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".
It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.
This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.
You have a problem with badly behaved scrapers, not AI.
I can't disagree with being against badly behaved scrapers. But this is neither a new problem or an interesting one from the idea of making information freely available to everyone, even rhinoceroses, assuming they are well behaved. Blocking bad actors is not the same thing as blocking AI.
The thing is that rhinoceroses aren't well-behaved. Even if some small fraction of them in theory might be well-behaved, the effort of trying to account for that is too small to bother. If 99% of rhinoceroses aren't well-behaved, the simple and correct response is to ban them all, and then maybe the nice ones can ask for a special permit. You switch from allow-by-default to block-by-default.
Similarly it doesn't make sense to talk about what happens if AI bots were well-behaved. If they are, then maybe that would be okay, but they aren't, so we're not talking about some theoretical (or past) situation where bots were well-behaved and scraped in a non-disruptive fashion. We're talking about the present reality in which there actually are enormous numbers of badly-behaved bots.
Incidentally, I see that in a lot of your responses on this thread you keep suggesting that people's problem is "not with AI" but with something else. But look at your comment that I initially replied to:
> Blocking AI training bots is not free and open for all.
We're not talking about "AI". We're talking about AI training bots. If people want to develop AI as a theoretical construct and train it on datasets they download separately in a non-disruptive way, great. (Well, actually it's still terrible, but for other reason. :-) ) But that's not what people are responding to in this thread. They're talking about AI training bots that scrape websites in a way that is objectively more harmful than previous generations of scrapers.
A rhino can't not be huge and destructive and humans can't not be shitty and selfish. Badly behaved scrapers are simply an inevitable fact of the universe and there's no point trying to do anything because it's an immutable law of reality and can never be changed, so don't bother to try
ISPs are supposed to disconnect abusive customers. The correct thing to do is probably contact the ISP. Don't complain about scraping, complain about the DDOS (which is the actual problem and I'm increasingly beginning to believe the intent.)
Great! How do I get, say, Google's ISP to disconnect them?
Every ISP has an abuse email contact you can look up.
Sure, let me just contact that one ISP located in Russia or India, I am sure they will care a lot about my self-hosted blog
Hence the need for Cloudflare?
I am not comfortable with a private company being the only solution, especially when they have a history of deplatforming sites.
Except that's exactly what you should do. And if they refuse to cooperate you contact the network operators between them and yourself.
Imagine if Chinese or Russian criminal gangs started sending mail bombs to the US/EU and our solution would be to require all senders, including domestic ones, to prove their identity in order to have their parcels delivered. Completely absurd, but somehow with the Internet everyone jumps to that instead of more reasonable solutions.
The internet is not a mirror of the real world.
But many people feel that the very act of incorporating your copyrighted words into their for-profit training set is itself the bad behavior. It's not about rate-limiting scrapers, it's letting them in the door in the first place.
Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.
The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
They were okay with it when Google was sending them traffic. Now they often don’t. They’ve broken the social contract of the web. So why should the sites whose work is being scraped be expected to continue upholding their end?
Not only are they scraping without sending traffic, they're doing so much more aggressively than Google ever did; Google, at least, respected robots.txt and kept to the same user-agent. They didn't want to index something that a server didn't want indexed. AI bots, on the other hand, want to index every possible thing regardless of what anyone else says.
There's something more obviously nefarious and existential about AI. It takes the idea of "you are the product" to a whole new level.
> Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
It wasn't okay, it's just that the reasons it wasn't okay didn't become apparent until later.
> The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
Many of those people will likely have a problem with it later, for reasons that are happening now but that they won't become fully aware of until later.
[flagged]
[flagged]
[flagged]
Sure. But we're already talking about presumption of free and open here. I'm sure people are also reading my words and incorporating it into their own for-profit work. If I cared, I wouldn't make it free and open in the first place.
But that is not something you can protect against with technical means. At beast you can block the little fish and give even more power to the mega corporations who will always have a way to get to the data - either by operating crawlers you cannot afford to block, incentivizing users to run their browsers and/or extensions that collect the data and/or buying the data from someone who does.
All you end up doing is participating in the enshittification of the web for the rest of us.
Badly behaved scrapers are not a new problem, but badly behaved scrapers run by multibillion-dollar companies which use every possible trick to bypass every block or restriction or rate limit you put in front of them is a completely new problem on a scale we've never seen before.
> You have a problem with badly behaved scrapers, not AI.
And you have a problem understanding that "freedom and openness" extend only to where the rights (e. g. the freedom) of another legal entity begins. When I don't want "AI" (not just the badly-behaved subset) rifling my website then I should be well within my rights to disallow just that, in the same way as it's your right to allow them access to your playground. It's not rocket science.
This is not what the parent means. What they mean is such behavior is a hypocrisy. Because you are getting access to truly free websites whose owners are interested in having smart chatbots trained on the free web, but you are blocking said chatbots while touting "free Internet" message.
AI is one of those bad actors
But we should also not throw out the baby with the bathwater. All these attempts at blocking AI bots also block other kinds of crawlers as well as real users with niche browsers.
Meanwhile if you are concerned with the parasitic nature of AI companies then no technical measure will solve that. As you have already noted, they can just buy your data from someone else who you can't afford to block - Google, users with a browser extension that records everything, bots that are ahead of you in the game of cat and mouse, etc.
> "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?"
What this scenario actually reveals is that the words "open to the public" are not intended to mean "access is completely unrestricted".
It's fine to not want to give completely unrestricted access to something. What's not fine, or at least what complicates things unnecessarily, is using words like "open and free" to describe this desired actually-we-do-want-to-impose-certain-unstated-restrictions contract.
I think people use words like "open and free" to describe the actually-restricted contracts they want to have because they're often among like-minded people for whom these unstated additional restrictions are tacitly understood -- or, simply because it sounds good. But for precise communication with a diverse audience, using this kind of language is at best confusing, at worst disingenuous.
Nobody has ever meant "access is completely unrestricted".
As a trivial example: what website is going to welcome DDoS attacks or hacking attempts with open arms? Is a website no longer "open to the public" if it has DDoS protection or a WAF? What if the DDoS makes the website unavailable to the vast majority of users: does blocking the DDoS make it more or less open?
Similarly, if a concert is "open to the public", does that mean they'll be totally fine with you bringing a megaphone and yelling through the performance? Will they be okay with you setting the stage on fire? Will they just stand there and say "aw shucks" if you start blocking other people from entering?
You can try to rules-lawyer your way around commonly-understood definitions, but deliberately and obtusely misinterpreting such phrasing isn't going to lead to any kind of productive discussion.
>You can try to rules-lawyer your way around commonly-understood definitions
Despite your assertions to the contrary, "actually free to use for any purpose" is a commonly understood interpretation of "free to use for any purpose" -- see permissive software licenses, where licensors famously don't get to say "But I didn't mean big companies get to use it for free too!"
The onus is on the person using a term like "free" or "open" to clarify the restrictions they actually intend, if any. Putting the onus anywhere else immediately opens the way for misunderstandings, accidental or otherwise.
To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert. They do only the things an ordinary member of the public do; they can't do anything else. The most "damage" they can do is to keep humans who would enjoy the concert from being able to attend if there aren't enough seats; whatever additional costs they cause (air conditioning, let's say) are the same as the costs that would have been incurred by that many humans.
> To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert.
The scraper is sending ten million robots to your concert. They're packing out every area of space, they're up on the stage, they're in all the vestibules and toilets even though they don't need to go. They've completely crowded out all the humans, who were the ones who actually need to see the concert.
You'd have been fine with a few robots. It used to be the case that companies would send one robot each, and even though they were videotaping, they were discreet about it and didn't get in the humans way.
Now some imbecile is sending millions of robots, instead of just one with a video camera. All the robots wear the scraper's company uniform at first, so to deal with this problem you tell all robots wearing it to go home. Then they all come back dressed identically to the humans in the queue, as they jump ahead of them, to deliberately disguise who they are because they know you'll kick them out. They're not taking no for an answer, and they're going to use their sheer mass and numbers to block out your concert. Nobody seems to know why they do it, and nobody knows who is sending the robots for sure, because robot owners are all denying it's theirs. But somebody is sending them.
Using "open and free" to mean "I actually want no restrictions at all" is also confusing and disingenuous, because, as you yourself point out, a lot of people don't mean that by those words.
The other thing, though, is that there's a difference between "I personally want to release my personal work under open, free, and unrestricted terms" and "I want to release my work into a system that allows people to access information in general under open, free, and unrestricted terms". You can't just look at the individual and say "Oh, well, the conditions you want to put on your content mean it's not open and free so you must not actually want openness and freedom". You have to look at the reality of the entire system. When bots are overloading sites, when information is gated behind paywalls, when junk is firehosed out to everyone on behalf of paid advertisers while actual websites are down on page 20 of the search results, the overall situation is not one of open and free information exchange, and it's naive to think that individuals simply dumping their content "openly and freely" into this environment is going to result in an open and free situation.
Asking people to just unilaterally disarm by imposing no restrictions, while other less noble actors continue to impose all sorts of restrictions, will not produce a result that is free of restrictions. In fact quite the opposite. In order to actually get a free and open world in the large, it's not sufficient for good actors to behave in a free and open manner. Bad actors also must be actively prevented from behaving in an unfree and closed manner. Until they are, one-sided "gifts" of free and open content by the good actors will just feed the misdeeds of the bad actors.
> Asking people to just unilaterally disarm by imposing no restrictions
I'm not asking for this. I'm asking for people who want such restrictions (most of which I consider entirely reasonable) to say so explicitly. It would be enough to replace words like "free" or "open" with "fair use", which immediately signals that some restrictions are intended, without getting bogged down in details.
Why? It seems you already know what people mean by "open and free", and it does have a connection to the ideals of openness and freedom, namely in the systemic context that I described above. So why bother about the terminology?
What people mean by words like "open" and "free" varies. It varies a lot, and a lot turns on what they actually mean.
The only sensible way forward is to be explicit.
Why fight this obvious truth? Why does it hurt so much to say what you mean?
You can always stop bots. Add login/password. But people want their content to be accessible to as large audience as possible, but at the same time they don't want that data to be accessible to the same audience via other channels. logic. Bots are not consuming your data - humans are. At the end of the day humans will eventually read it and take actions. For example chatgpt will mention your site, the user will visit it.
And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
> And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
I'm sorry, but this statement shows you have no recent experience with these crawlernets.
Google, from the beginning, has done their best to work with server owners. They respect robots.txt. I think they were the first to implement Crawl-Delay. They crawl based on how often things change anyway. They have an additional safeguard that when they notice a slowdown in your responses, they back off.
Compare this with Anthropic. On their website they say they follow robots.txt and Crawl-Delay. I have an explicit ban on Claudebot in there and a Crawl-Delay for everyone else. It ignores both. I send an email to them about this, and their answer didn't address the discrepancy between the docs and the behaviour. They just said they'll add me to their internal whitelist and that I should've sent 429s when they were going too fast. (Fuck off, how about you follow your public documentation?)
That's just my experience, but if you Google around you'll find that Anthropic is notorious for ignoring robots.txt.
And still, Claudebot is one of the better behaved bots. At least they identify themselves, have a support email they respond to, and use identifiable IP-addresses.
A few weeks ago I spend four days figuring out why I had 20x the traffic I normally have (which maxed out the server; causing user complaints). Turns out there are parties that crawl using millions of (residential) IPs and identify themselves as normal browsers. Only 1 or 2 connections per IP at the time. Randomization of identifying properties. Even Anthropics 429 solution wouldn't have worked there.
I managed to find a minor identifying property in some of the requests that wasn't catching too many real users. I used that to start firewalling IPs on sight and then their own randomization caused every IP to fall into the trap in the end. But it took days.
In the end I had to firewall nearly 3 million non-consecutive IP addresses.
So no, Google in 1996 or 2006 or 2016 is not the same as the modern DDoSing crawlernet.
I am still a bit confused by what some of these crawlers are getting out of it; repeatedly crawling sites that haven't changed seems to be the norm for the current crawlernets, which seems like a massive waste of resources on their end for what is, on average, data of rather indifferent quality.
Nothing. They're not designed to be useful. They're designed to grab as much data as possible and they'll figure out what to do with it later - they don't know it's mostly useless yet.
Tarpits are cool.
Did you send any abuse reports to the ASNs for those IP addresses?
they're basically describing the tragedy of the commons, but if a handful of the people have bulldozers to rip up all the grass and trees.
We can't have nice things because the powerful cannot be held accountable. The powerful are powerful due to their legal teams and money, and power is the ability to carve exceptions to rules.
Bingo. Thanks for clarifying exactly my point
That's a very "BSD is freedom and GPL isn't" kind of philosophy.
Nothing is truly free unless you give equal respect to fellow hobbyists and megacorps using your labor for their profit.
GPL doesn't care if you use it for profit or not (good), it just says that the resultant model needs to be open too. And open models exist in droves nowadays. Even closed models can be distilled into open ones.
>You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
It's perfectly legit to want to have a "free and open for all except big corporations and AI engines".
I think that was the point. Everyone loves the dream, but the reality is different.
How so? If you don't want AI bots reading information on the web, you don't actually want a free and open web. The reality of an open web is that such information is free and available for anyone.
> If you don't want AI bots reading information on the web, you don't actually want a free and open web.
This is such a bad faith argument.
We want a town center for the whole community to enjoy! What, you don't like those people shooting up drugs over there? But they're enjoying it too, this is what you wanted right? They're not harming you by doing their drugs. Everyone is enjoying it!
If an AI bot is accessing my site the way that regular users are accessing my site -- in other words everyone is using the town center as intended -- what is the problem?
Seems to be a lot of conflating of badly coded (intentionally or not) scrapers and AI. That is a problem that predates AI's existence.
So if I buy a DDoS service and DDoS your site, it's ok as long as it accesses it the same way regular people do? In sorry for extreme example, it's obviously not, but that's how I understand your position as written.
We can also consider 10 exploit attempts per second that my site sees.
The issue is that people seem to be conflating badly built scraper bots with AI. If an AI accessed my site as frequently as a normal human (or say Googlebot) then that particular complaint merely goes away. It never had anything to do with AI itself.
Unironically, if we want everyone to enjoy the town center, we should let people do drugs.
Set aside that there's a pretty big difference between AI scraping and illegal drug usage.
If the person using illegal drugs is on no way harming anyone but themselves and not being a nuisance, then yeah, I can get behind that. Put whatever you want in your body, just don't let it negatively impact anyone around you. Seems reasonable?
I think this is actually a good example despite how stark the differences are - both the nuisance AI scrapers and the drug addicts have negative externalities that while possible for them to self regulate, they are for whatever reasons proving unable to do so, and therefore cause other people to have a bad time.
Other commenters saying the usual “drugs are freedom” type opinions, but now having lived in China and Japan where drugs are dealt with very strictly (and basically don’t have a drug problem today), I can see the other side of the argument where in fact places feeling dirty and dangerous because of drugs - even if you think of addicts sympathetically as victims who need help - makes everyone else less free to live the lifestyle they would like to have.
More freedom for one group (whether to ruin their own lives for a high; or to train their AI models) can mean less freedom for others (whether to not feel safe walking in public streets; or to publish their little blog in the public internet).
> just don't let it negatively impact anyone around you.
Exactly! Which is why we don't want AI bots siphoning our bandwidth & processing power.
Clearly you don't want the whole community to enjoy it then. Openness is incompatible with keeping the riff raff out
It isn't incompatible at all. You might also be shocked to learn that all you can eat buffets will kick you out if you grab all the food and dump it on your table.
> information is free and available for anyone.
Bots aren't people.
You can want public water fountains without wanting a company attaching a hose to the base to siphon municipal water for corporate use, rendering them unusable for everyone else.
You can want free libraries without companies using their employees' library cards to systematically check out all the books at all times so they don't need to wait if they want to reference one.
> Bots aren't people.
I am though and I get blocked by these bot checks all the time.
Buddha, what makes us human?
That's simple, running up to date Chrome on with javascript enabled does.
I want to be able to enjoy water fountains and libraries without having to show my ID. Somehow we are able to police those via other means, so let's not shit up the web with draconian measures either.
Does allow bots to access my information prevent other people from accessing my information? No. If it did, you'd have a point and I would be against that. So many strange arguments are being made in this thread.
Ultimately it is the users of AI (and am I one of them) that benefit from that service. I put out a lot of open code and I hope that people are able to make use of it however they can. If that's through AI, go ahead.
> Does allow bots to access my information prevent other people from accessing my information? No.
Yes it does, that's the entire point.
The flood of AI bots is so bad that (mainly older) servers are literally being overloaded and (newer servers) have their hosting costs spike so high that it's unaffordable to keep the website alive.
I've had to pull websites offline because badly designed & ban-evading AI scraper bots would run up the bandwidth into the TENS OF TERABYTES, EACH. Downloading the same jpegs every 2-3 minutes into perpetuity. Evidently all that vibe coding isn't doing much good at Anthropic and Perplexity.
Even with my very cheap transfer racks up $50-$100/mo in additional costs. If I wanted to use any kind of fanciful "app" hosting it'd be thousands.
I'm still very confused by who is actually benefitting from the bots; from the way they behave it seems like they're wasting enormous amounts of resources on both ends for something that could have been done massively more efficiently.
That's a problem with scrapers, not with AI. I'm not sure why there are way more AI scraper bots now than there were search scraper bots back when that was the new thing. However that's still an issue of scapers and rate limiting and nothing to do with wanting or not wanting AI to read your free and open content.
This whole discussion is about limiting bots and other unwanted agents, not about AI specifically (AI was just an obvious example)
Do the AI training bots provide free access to the distillation of the content they drain from my site repeatedly? Don't they want a free and open web?
I don’t feel a particular need to subsidize multi–billion even trillion dollar corporations with my content, bandwidth, and server costs since their genius vibe coded bots apparently don’t know how to use modified-GETs or caching, let alone parse and respect robots.txt.
Is the problem they exist or the problem they are badly accessing your site? Because there are two conflating issues here. If humans or robots are causing you issues, as both can do, that's bad. But that has nothing to do with AI in particular.
Problem one is they do not honor the conventions of the web and abuse the sites. Problem two is they are taking content for free, distilling it into a product, and limiting access to that product.
Problem one is not specific to AI and not even about AI.
Problem two is not anything new. Taking freely available content and distilling it into a product is something valuable and potentially worth paying for. People used to buy encyclopedias too. There are countless examples.
Problem one _is_ about AI.
It was a similar problem with cryptocurrencies. Out comes some kind of tech thingy, and a million get-rich-quick scammers pop out of the woodwork and start scamming left, right and center. Suddenly everyone's in on the hustle, everyone's cryptomining, or taking over computers and using them for cryptomining, they're setting the world on fire with electricity consumption through the roof just to fight against other people (who they wouldn't need to fight against if they'd just cooperate).
A vision. A gold rush. A massive increase in shitty human behaviour motivated by greed.
And now here we are again with AI. Massive interest. Trillions of dollars being sloshed around, everyone hustling to develop something so they'll get picked and flooded with cash. An enormous pile of deeply unethical and disrespectful behaviour by people who are doing what they're doing because that's where the money is. The AI bubble.
At present, problem one is almost entirely AI companies.
There's actually not much evidence of this, since the attack traffic is anonymous.
HN people working in these AI companies have commented to say they do this, and the timing correlates with the rise of AI companies/funding.
I haven't tried to find it in my own logs, but others have said blocking an identifiable AI bot soon led to the same pattern of requests continuing through a botnet.
Did HN people present evidence?
And a few decades ago, it would have been search engine scrapers instead.
And that problem was largely solved by robots.txt. AI scrapers are ignoring robots.txt and beating the hell out of sites. Small sites that have decades worth of quality information are suffering the most. Many of the scrapers are taking extreme measures to avoid being blocked, like using large numbers of distinct IP addresses (perhaps using botnets).
The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
Freedom, the word, while implies no boundaries, is always bound by ethics, mutual respect and "do no harm" principle. The moment you trip either one of these wires and break them, the mechanisms to counter it becomes active.
Then we cry "but, freedom?!". Freedom also contains the consequences of one's actions.
Freedom without consequences is tyranny of the powerful.
The problem isn't "AI bot scraping while disregarding all licenses and ethical considerations". The problem is "AI bot scraping while ignoring every good practice to reduce bandwidth usage".
If you ask me "every good practice to reduce bandwidth usage" falls under ethics pretty squarely, too.
While this is certainly a problem, it's not the only problem.
> The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
What licenses? Free and open web. Go crazy. What ethical considerations? Do I police how users use the information on my site? No. If they make a pipe bomb using an 6502 CPU using code taken from my website -- am I supposed to do something about that?
Creative Commons, GFDL, Unlicense, GPL/AGPL, MIT, WTFPL. Go crazy. I have the freedom to police how users use the information on my site. Yes.
Real examples: My blog is BY-NC-SA and digital garden is GFDL. You can't take them, mangle and sell them. Especially, the blog.
AI companies take these posts, and sell derivatives, without any references, consent or compensation. BY-NC-SA is complete opposite of what they do.
This is why I'm not uploading any photos I take publicly anymore.
Absolutely. If you want to put all kinds of copyright, license, and even payment restrictions on your content go ahead. And if AI companies or people abuse that, that's bad on them.
But I do think if you're serious about free and open information than why are you doing that in the first place? It's perfectly reasonable to be restrictive; I write both very open software and very closed software. But I see a lot of people want to straddle the line when it comes to AI without a rational argument.
Let me try to make my point as compact as possible. I may fail, but please bear with me.
I prefer Free Software to Open Source software. My license of choice is A/GPLv3+. Because, I don't want my work to be used by people/entities in a single sided way. The software I put out is the software I develop for myself, with the hope of being useful for somebody else. My digital garden is the same. My blog is a personal diary in the open. These are built on my free time, for myself, and shared.
See, permissive licenses are for "developer freedom". You can do whatever you do with what you can grab, as long as you write a line to credits. A/GPL family is different. Wants reciprocity. It empowers the user vs. the developer. You have to give the source. Who modifies the source, shares the modifications. It stays in the open. It has to stay open.
I demand this reciprocity for what I put out there. The licenses reflect that. It's "restricting the use to keep the information/code open". I share something I spent my time on, and I want it to live on the open, want a little respect for putting out what I did. That respect is not fame or superiority. Just not take it and run with it, keeping all the improvements to yourself.
It's not yours, but ours. You can't keep it to yourself.
When it comes to AI, it's an extension of this thinking. I do not give consent to a faceless corporation to close, twist and earn money from what I put out for public good. I don't want a set of corporations act as a middleman to get what I put out, repackage and corrupt it in the process and sell it. It's not about money; it's about ethics, doing the right thing and being respectful. It's about exploitation. Same is applicable to my photos.
I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies. I equally get angry when a company's source available code is scraped and used for suggestions as well as an academic's LGPL high performance matrix library which is developed via grants over the years. This thing affect livelihoods of people.
I get angry when people say "if we take permission for what we do, AI industry will collapse", or "this thing just learns like humans, this is fair use".
I don't buy their "we're doing something awesome, we need no permission" attitude. No, you need permission to use my content. Because I say so. Read the fine print.
I don't want knowledge to be monopolized by these corporations. I don't want the small fish to be eaten by the bigger one and what remains is buried into the depths of information ocean.
This is why I stopped sharing my photos for now, and my latest research won't be open source for quite some time.
What I put out is for humans' direct consumption. Middlemen are not welcome.
If you have any questions or left any holes up there, please let me know.
I respect the desire for reciprocity, but strong copyleft isn't the only, or even the best, way to protect user freedom or public knowledge. My opinion is that permissive licensing and open access to learn from public materials have created enormous value precisely because they don't pre-empt future uses. Requiring permission for every new kind of reuse (including ML training) shrinks the commons, entrenches incumbents who already have data deals, and reduces the impact of your work. The answer to exploitation is transparency, attribution, and guardrails against republication, not copyright enforced restrictions.
I used to be much more into the GPL than I am now. Perhaps it was much more necessary decades ago or perhaps our fears were misguided. I license all my own stuff as Apache. If companies want to use it, great. It doesn't diminish what I've done. But those who prefer GPL, I completely understand.
> as well as an academic's LGPL high performance matrix library which is developed via grants over the years.
The academic got paid with grants. So now this high performance library exists in the world, paid for by taxes, but it can't be used everywhere. Why is it bad to share this with everyone for any purpose?
> What I put out is for humans' direct consumption. Middlemen are not welcome.
Why? Why must it be direct consumption? I've use AI tools to accomplish things that I wouldn't be able to do on my own in my free time -- work that is now open source. Tons of developers this week are benefiting from what I was able to accomplish using a middle man. Not all middlemen, by definition, are bad. Middlemen can provide value. Why is that value not welcome?
> I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies.
If you define AI/LLM/Generative technology/etc as the exploitation of exploitation of people, artists, musicians, software developers, other companies then you are against it. As software developers our work directly affects the livelihoods of people. Everything we create is meant to automate some human task. To be a software developer and then complain that AI is going to take away jobs is to be a hypocrite.
Your whole argument is easily addressed by requiring the AI models to be open source. That way, they obviously respect the AGPL and any other open license, and contribute to the information being kept free. Letting these companies knowingly and obviously infringe licenses and all copyright as they do today is obviously immoral, and illegal.
AGPL doesn't pre-empt future uses or require permission for any kind of re-use. You just have to share alike. It's pretty simple.
AGPL lets you take a bunch of data and AI-train on it. You just have to release the data and source code to anyone who uses the model. Pretty simple. You don't have to rent them a bunch of GPUs.
Actually it can be annoying because of the specific mechanism by which you have to share alike - the program has to have a link to its own source code - you can't just offer the source alongside the binary. But it's doable.
How is it available for everyone if the AI bots bring down your server?
Is that really the problem we are discussing? I've had people attack my server and bring it down. But that has nothing to do with being free and open to everyone. A top hacker news post could take my server.
Yes, because a top hacker news post takes your server down because a large number of actual humans are looking to gain actual value from your posts. Meanwhile, you stand to benefit from the HN discussion by learning new things and perspectives from the community.
The AI bot assault, on the other hand, is one company (or a few companies) re-fetching the same data over and over again, constantly, in perpetuity, just in case it's changed, all so they can incorporate it into their training set and make money off of it while giving you zero credit and providing zero feedback.
But then we get to use those AI tools.
The refrain here comes down not to "AI" but mostly to "the AI bot assault" which is a different thing. Sure lets have an discussion about badly behaved and overzealous web scrapers. As for credit, I've asked AI for it's references and gotten them. If my information is merely mushed into AI training model I'm not sure why I need credit. If you discuss this thread with your friends are you going to give me credit?
No, you don't "get to" use the AI tools. You have to buy access to them (beyond some free trials).
Yes. I get to buy access to them. They're providing an expensive to provide service that requires specialized expertise. I don't see the problem with that.
"If you discuss this thread with your friends are you going to give me credit?"
Yes. How else would I enable my friends to look it up for themselves?
6 months from now when you've internalized this entire thread are you even going to remember where you got it from?
Why are you shifting the discussion by adding two new variables (time/memory)?
Because that's how one interacts with AI.
Yeah. Running out of arguments, are you?
[dead]
Ultimately, you have to realize that this is a losing battle, unless we have completely draconian control over every piece of silicon. Captchas are being defeated; at this point they're basically just mechanisms to prove you Really Want to Make That Request to the extent that you'll spend some compute time on it, which is starting to become a bit of a waste of electricity and carbon.
Talented people that want to scrape or bot things are going to find ways to make that look human. If that comes in the form of tricking a physical iPhone by automatically driving the screen physically, so be it; many such cases already!
The techniques you need for preventing DDoS don't need to really differentiate that much between bots and people unless you're being distinctly targeted; Fail2Ban-style IP bans are still quite effective, and basic WAF functionality does a lot.
Agreed, copyright issues need to be solved via legislation and network abuse issues need to be solved by network operators. Trying to run around either only makes the web worse for everyone.
Rate-limits? Use a CDN? Lots of traffic can be a problem whether it's bots or humans.
You realize this entire thread is about a pitch from a CDN company trying to solve an issue that has presented itself at such a scale that this is the best option they can think of to keep the web alive, right?
"Use a CDN" is not sufficient when these bots are so incredibly poorly behaved, because you're still paying for that CDN and this bad behavior is going to cost you a fortune in CDN costs (or cost the CDN a fortune instead, which is why Cloudflare is suggesting this).
Everyone can get it from the bots now?
Build better
Nothing is „free“. AI bots eat up my blog like crazy and I have to pay for its hosting.
Don't you have rate-limits? And how much are you paying for the instance where you're hosting it? I've run/helped run projects with something like ~10 req/s easily on $10 VPSs, surely hosting HTML can't cost you that much?
Of course it won't be free, but you can get pretty close to free but employing the typical things you'd put in place to restrict the amount of resources used, like rate-limits, caches and so on.
And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Why is outsourcing this to Cloudflare bad and doing it yourself ok? Am I allowed to buy a license to a rate limiter or do I need to code my own? Am I allowed to use a firewall or is blocking people from probing my server not free enough?
Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
> And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Where are people getting this from? No, Cloudflare or any other CDN is not required for you to host your own stuff. Sure, it's easy, and probably the best way to go if you just wanna focus on shipping, but lets not pretend it's a requirement today.
> Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
I don't think they are, that's why we have rate limiters, right? :) I think the point is that if you're allowing a user to access some content in one way, why not allow that same user to access the content in the same way, but using a different user-agent? That's the original purpose of that header after all, to signal what the user used as an agent on their behalf. Commonly, I use Firefox as my agent for browsing, but I should be free to use any user-agent, if we want the web to remain open and free.
My point is that people choose to outsource the complexity of running a rate limiter and blocking bad actors to Cloudflare and others like them is not the issue you make it out to be.
Why is it good for me to do it myself but bad to pay Cloudflare $20 a month to do it for me. No one is forcing me to use their services. I still have the option to do it myself, or use someone else, or not use anything at all. Seems pretty free to me.
Many AI scraping bots are notoriously bad actors and are hammering sites. Please don’t pretend they are all or even mostly well behaved. We didn’t have this push with the search engine scraping bots as those were mostly well behaved.
You are setting up a straw man with a “hey why not let this hypothetical we’ll behaved bot in”. That isn’t the argument or reality. We didn’t have the need to block Google, Yahoo, or Bings bot because they respected robots.txt and had a reasonable frequency of visits.
[dead]
The dream is real, man. If you want open content on the Internet, it's never been a better time. My blog is open to all - machine or man. And it's hosted on my home server next to me. I don't see why anyone would bother trying to distinguish humans from AI. A human hitting your website too much is no different from an AI hitting your website too much.
I have a robots.txt that tries to help bots not get stuck in loops, but if they want to, they're welcome to. Let the web be open. Slurp up my stuff if you want to.
Amazonbot seems to love visiting my site, and it is always welcome.
> I don't see why anyone would bother trying to distinguish humans from AI.
Because a hundred thousand people reading a blog post is more beneficial to the world than an AI scraper bot fetching my (unchanged) blog post a hundred thousand times just in case it's changed in the last hour.
If AI bots were well-behaved, maintained a consistent user agent, used consistent IP subnets, and respected robots.txt, I wouldn't have a problem with them. You could manage your content filtering however you want (or not at all) and that would be that. Unfortunately at the moment, AI bots do everything they can to bypass any restrictions or blocks or rate limits you put on them; they behave as though they're completely entitled to overload your servers in their quest to train their AI bots so they can make billions of dollars on the new AI craze while giving nothing back to the people whose content they're misappropriating.
I've not seen an AI scraper reading a blog post 100,000 times in an hour to see if it's changed. As far as I can tell, that's a NI hallucination. Typical fetch rates are more like 3 times per second (10k per hour) and fetch a different URL each time.
>Because a hundred thousand people reading a blog post is more beneficial to the world than an AI scraper bot fetching my (unchanged) blog post a hundred thousand times just in case it's changed in the last hour.
You have zero evidence of this actually happening (because it's not happening).
The only bot that bugs the crap out of me is Anthropic's one. They're the reason I set up a labyrinth using iocaine (https://iocaine.madhouse-project.org/). Their bot was absurdly aggressive, particularly with retries.
It's probably trivial in the whole scheme of things, but I love that anthropic spent months making about 10rps against my stupid blog, getting markov chain responses generated from the text of Moby Dick. (looks like they haven't crawled my site for about a fortnight now)
No wonder Anthropic isn't working well! The "Moby Dicked" explanation of the state of AI!
But seriously, Why must someone search even a significant part of the public Internet to develop an AI? Is it believed that missing some text will cripple the AI?
Isn't there some sort of "law of diminishing returns" where, once some percentage of coverage is reached, further scraping is not cost-effective?
On the contrary, AI training techniques require gigantic amounts of data to do anything, and there is no upper limit whatsoever - the more relevant data you have to train on, the better your model will be, period.
In fact, the biggest thing that is making it unlikely that LLM scaling will continue is that the current LLMs have already been trained on virtually every piece of human text we have access to today. So, without new training data (in large amounts), the only way they'll scale more is by new discoveries on how to train more efficiently - but there is no way to put a predictable timeline on that.
Ironically, scaling limits and evidence that quality vastly outweighs quantity suggests that all that web data is much less useful than buying and scanning a book. Most work with the Common Crawl data, for example, has ended up focusing on filtering out vast amounts of data as being mostly useless for training purposes.
There was a hot minute in 2023 where it looked like we could just data and compute scale to the moon. Shockingly, it turns out there are limits to that approach.
It's traditional to include a link when claiming to be invulnerable. :)
Haha, sounds a bit self-promotional to do that but link in profile.
Not claiming that the site is technologically invulnerable. Just that it's not a big deal if LLMs scrape it (which bizarrely they do).
> Everyone loves the dream of a free for all and open web.
> protect their blog or content from AI training bots
It strikes me that one needs to chose one of these as their visionary future.
Specifically: a free and open web is one where read access is unfettered to humans and AI training bots alike.
So much of the friction and malfunction of the web stems from efforts to exert control over the flow (and reuse) of information. But this is in conflict with the strengths of a free and open web, chief of which is the stone cold reality that bytes can trivially be copied and distributed permissionlessly for all time.
It's the new "ban cassette tapes to prevent people from listening to unauthorized music," but wrapped in an anti-corporate skin delivered by a massive, powerful corporation that could sell themselves to Microsoft tomorrow.
The AI crawlers are going to get smarter at crawling, and they'll have crawled and cached everything anyway; they'll just be reading your new stuff. They should literally just buy the Internet Archive jointly, and only read everything once a week or so. But people (to protect their precious ideas) will then just try to figure out how to block the IA.
One thing I wish people would stop doing is conflating their precious ideas and their bandwidth. The bandwidth is one very serious issue, because it's a denial of service attack. But it can be easily solved. Your precious ideas? Those have to be protected by a court. And I don't actually care iff the copyright violation can go both ways; wealthy people seem to be free to steal from the poor at will, even rewarded, "normal" (upper-middle class) people can't even afford to challenge obviously fraudulent copyright claims, and the penalties are comically absurd and the direct result of corruption.
Maybe having pay-to-play justice systems that punish the accused before conviction with no compensation was a bad idea? Even if it helped you to feel safe from black people? Maybe copyright is dumb now that there aren't any printers anymore, just rent-seekers hiding bitfields?
By developing Free Software combating these hostile softwares.
Corporations develop hostile AI agents,
Capable hackers develop anti-AI-agents.
This defeatist atittude "we have no power".
Yes, I obviously agree with you. My comment's point is missed a little I think by you. CF is making these tools and giving access to it to millions of people.
Well there's open source stuff like https://github.com/TecharoHQ/anubis; one doesn't need a top-down mandated solution coming from a corporation.
In general Cloudflare has been pushing DRMization of the web for quite some time, and while I understand why they want to do it, I wish they didn't always show off as taking the moral high ground.
Anubis doesn’t necessarily stop the most well funded actors.
If anything we’ve seen the rise in complaints about it just annoying average users.
The actual response to which Anubis was created is seemingly a strange kind of DDOS attack that has been misattributed to LLMs, but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies. (Yes, it doesn’t help that the author of Anubis also isn’t fully aware of the mechanics of the attack. In fact, there is no proper write up of the mechanism of the attack which I hope to write about someday).
Having said that, the solution is effective enough, having a lightweight proxy component that issues proof of work tokens to such bogus requests works well enough, as various users on HN seem to point out.
> a strange kind of DDOS attack that has been misattributed to LLMs, , but is some kind of attacker that makes partial GET requests that are aborted soon after sending the request headers, mostly coming from residential proxies.
um, no? Where did you get this strange bit of info.
The original reports say nothing of that sort: https://news.ycombinator.com/item?id=42790252 ; and even original motivation for Anubis was Amazon AI crawler https://news.ycombinator.com/item?id=42750420
(I've seen more posts with the analysis, including one which showed an AI crawler which would identify properly, but once it hits the ratelimit, would switch to fake user agent from proxies.. but I cannot find it now)
How does an agent help my website not get crushed by traffic load, and how is this proposal any different from the gatekeeping problem to the open web, except even less transparent and accountable because now access is gated by logic inside an impenetrable web of NN weights?
This seems like slogan-based planning with no actual thought put into it.
Whatever is working against the AI doesn’t have to be an AI agent.
So proof of work checks everywhere?
Sure, as long as it doesn't discriminate against user agents.
So basically cloudflare but self-hosted (with all the pain that comes from that)?
What’s so painful about self hosting? I’ve been self hosting since before I hit puberty. If 12 year old me can run a httpd, anyone can.
And if you don’t want to self host, at least try to use services from organisations that aren’t hostile to the open web
I self-host lots of stuff. But yes it is more pain to host a WAF that can handle billions of request per minute. Even harder to do it for free like Cloudflare. And in the end the end result for the user is exactly the same if you use a self-hosted WAF or let someone else host it for you.
But you don't get billions of requests per minute. You get maybe five requests per second (300 per minute) on a bad day. The sites that seem to be getting badly attacked, they get 200 per second, which is still within reach of a self hosted firewall. Think about how many CPU cycles per packet that allows for. Hardly a real DDoS.
The only reason you even want to firewall 200 requests per second is that the code downstream of the firewall takes more than 5ms to service a request, so you could also consider improving that. And if you're only getting <5 and your server isn't overloaded then why block anything at all?
Such entitlement.
How much additional tax money should I spend at work so the AI scum can make 200 searches per second?
Human and 'nice' bots make about 5 per second.
If you're handling billions of requests per second, you're not a self hoster. That's a commercial service with a dedicated team to handle traffic around the clock. Most ISPs probably don't even operate lines that big
To put that in perspective, even if they're sending empty TCP packets, "several billion" pps is 200 to 1800 gigabits of traffic, depending on what you mean by that. Add a cookieless HTTP payload and you're at many terabits per second. The average self hoster is more likely to get struck by lightning than encounter and need protection from this (even without considering the, probably modest, consequences of being offline a few hours if it does happen)
Edit: off by a factor of 60, whoops. Thanks to u/Gud for pointing that out. I stand by the conclusion though: less likely to occur than getting struck by lightning (or maybe it's around equally likely now? But somewhere in that ballpark) and the consequences of being down for a few hours are generally not catastrophic anyway. You can always still put big brother in front if this event does happen to you and your ISP can't quickly drop the abusive traffic
If somebody decides they hate you, your site that could handle, say, 100,000 legitimate requests per day could suddenly get billions of illegitimate requests.
They could. Let me know when it happens
I have this argument every time self hosting comes up, and every time I wonder if someone will do it to me to make a point. Or if one of the like million other comments I post upsets someone or one of the many tools that I host. Yet to happen, idk. It's like arguing whether you need a knife on the street at all times because someone might get angry from a look. It happens, we have a word for it in NL (zinloos geweld) and tiles in sidewalks (lady bug depictions) and everything, but no normal person actually wears weapons 24/7 (drug dealers surely yeah) or has people talk through a middle person
I'd suspect other self hosters just see more shit than I do, were it not for that nobody ever says it happened to them. The only argument I ever hear is that they want to be "safe" while "self hosting with cloudflare". Who's really hosting your shit then?
I've had my involvement with the computer underground.
A web site owner published something he really shouldn't have and got hacked. I wound up being a "person of interest" in the resulting FBI investigation because I was the weirdest person in the chat room for the site. I think it drove them crazy I was using Tor so they got somebody to try to entrap me into sharing CP but (1) I'm not interested and (2) know better than that.
That's definitely the most interesting response I've had to this question, thanks for that
Will have to give this a second thought but as a first one now that I read this: ...and would Cloudflare have helped against the FBI, or any foreign nation doing a request with Cloudflare against child porn? Surely not?! A different kind of opsec is surely more relevant there, so I don't know if it's really relevant to "normal", legal self hosting (as opposed to criminal, much less that level of unethical+criminal) communities or if there's an aspect I'm missing here
Not everybody wants to manage some commercial grade packet filter that can handle some DDoSing script kiddie, it’s a strong argument.
But another argument against using the easiest choice, the near monopoly, is that we need a diverse, thriving ecosystem.
We don’t want to end up in a situation where suddenly Cloudflare gets to dictate what is allowed on the web.
We have already lost email to the tech giants, try running your own mail sometime. The technical aspect is easy, the problem is you will end up in so many spam folders it’s disgusting.
What we need are better decentralized protocols.
Please do try running your own mail some time. It's not nearly as hard as doomers would have you think. And if you only receive, you don't have any problems at all.
At first, you can use it for less serious stuff until you see how much it works.
I do, I host my own mail server.
Technically it's not very challenging. The problem is the total dominance of a few actors and a lot of spammers.
I haven't had spam issues since using a catch-all and giving everyone a unique address, blocking ones that receive spam
Won't work if you need a fixed address on a business card or something, but in case you don't...
Waiting for the day they catch on. Then it's time for a challenge-response protocol I guess
To be fair, he did say per minute :-)
Oh, whoops. Divide everything by 60, quick!
That does make it a bit less ludicrous even if I think the conclusion of my response still applies
That's a mantra, not a solution.
Sometimes it's a hardware problem, not a software problem.
For that matter, sometimes it's a social/political problem and not a technological problem.
This is the attitude I like to see. As they say, actually I hate this because of past connotations but "freedom isn't free"
We have thousands of engineers of these companies right here on hackernews and they cry and scream about privacy and data governance on every topic but their own work. If you guys need a mirror to do some self reflection I am offering to buy.
In the recent days, the biggest delu-lulz was delivered by that guy who'd bravely decided to boycott Grok out of... environmental concerns, apparently. It's curious how everybody is so anxious these days, about AI among other things in our little corner of the web. I swear, every other day it's some new big fight against something... bad. Surely it couldn't ALL be attributed to policy in the US!
I'll contribute for the mirror. The hypocrisy is so loud, aliens in outer space can hear it (and sound doesn't even travel in vacuum).
What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.
I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).
[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....
To be more specific, if we assume good faith upon our fine congresspeople to craft this well... ok yeah, well for hypothetical case I'll continue...
What legal teeth I would advocate would be targeted to crawlers (a subset of bot) and not include your usage. It would mandate that Big Corp crawlers (for search indexing, AI data harvesting, etc.) be registered and identify themselves in their requests. This would allow serverside tools to efficiently reject them. Failure to comply would result in fines large enough to change behavior.
Now that I write that out, if such a thing were to come to pass, and it was well received, I do worry that congress would foam at the mouth to expand it to bots more generally, Microsoft-Uncertified-Devices, etc.
Yeah, my main worry here is how we define the unwanted traffic, and how that definition could be twisted by bigcorp lawyers.
If it's too loose and similar to "wanted traffic is how the authors intend the website to be accessed, unwanted traffic is anything else", that's an argument that can be used against adblocks, or in favor of very specific devices like you mention. Might even give slightly more teeth to currently-unenforceable TOS.
If it's too strict, it's probably easier to find loopholes and technicalities that just lets them say "technically it doesn't match the definition of unwanted traffic".
Even if it's something balanced, I bet bigcorp lawyers will find a way to twist the definitions in their favor and set a precedent that's convenient for them.
I know this is a mini-rant rather than a helpful comment that tries to come up with a solution, it's just that I'm pessimistic because it seems the internet becomes a bit worse day by day no matter what we try to do :c
You don't get sued for using a service as it is meant to be used (using an RSS reader on their feed endpoint; cloning repositories that it is their mission to host). It doesn't anger anyone so they wouldn't bother trying to enforce a rule, and secondly it's a fruitless case because the judge would say it's not a reasonable claim they're making
Robots.txt is meant for crawlers, not user agents such as a feed reader or git client
I agree with you, generally you can expect good faith to be returned with good faith (but here I want to make heavy emphasis that I only agree on the judge part iff good faith can be assumed and the judge is informed enough to actually be able to make an informed decision).
But not everyone thinks that's the purpose of robots.txt. Example, quoting Wikipedia[1] (emphasis mine):
> indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
Quoting the linked `web robots` page[2]:
> An Internet bot, web robot, robot, or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. [...] The most extensive use of bots is for web crawling, [...]
("usually" implying that's not always the case; "most extensive use" implying it's not the only use.)
Also a quick HN search for "automated robots.txt"[3] shows that a few people disagree that it's only for crawlers. It seems to be only a minority, but the search results are obviously biased towards HN users, so it could be different outside HN.
Besides all this, there's also the question of whether web scraping (not crawling) should also be subject to robots.txt or not; where "web scraping" includes any project like "this site has useful info but it's so unusable that I made a script so I can search it from my terminal, and I cache the results locally to avoid unnecessary requests".
The behavior of alternative viewers like Nitter could also be considered web scraping if they don't get their info from an API[4], and I don't know if I'd consider Nitter the bad actor here.
But yeah, like I said I agree with your comment and your interpretation, but it's not the only interpretation of what robots.txt is meant for.
[1]: https://en.wikipedia.org/wiki/Robots.txt
[2]: https://en.wikipedia.org/wiki/Internet_bot
[3]: https://hn.algolia.com/?dateRange=all&query=automated%20robo...
[4]: I don't know how Nitter actually works or where does it get its data from, I just mention it so it's easier to explain what I mean by "alternative viewer".
> This means I'd get sued for using a feed reader on Codeberg
you think codeberg would sue you?
Probably not.
But it's the same thing with random software from a random nobody that has no license, or has a license that's not open-source: If I use those libraries or programs, do I think they would sue me? Probably not.
It wouldn’t stop anyone. The bots you want to block already operate out of places where those laws wouldn’t be enforced.
Then that is a good reason to deny the requests from those IPs
I've run a few hundred small domains for various online stores with an older backend that didn't scale very well for crawlers and at some point we started blocking by continent.
It's getting really, really ugly out there.
If that was the case then I am I getting buttflare-blocked here in the EU.
What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build. The only purpose I ever tried disallowed robots for was preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go. Now I think we should write separate instructions for different kinds of robots: a search engine indexer shouldn't open pages which have serious side-effects (e.g. place an order) or display semi-realtime technical details but an LLM agent may be on a legitimate mission involving this.
> I see zero reasons to oppose robots visiting any website I would build.
> preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go.
What will you do when the bots ignore your instructions, and send a million requests a day to these URLs from half a million different IP addresses?
Let my site go down and then restart my server a few hours later. I'm a dude with a blog I'm not making uptime guarantees. I think you're overestimating the harm and how often this happens.
Misbehaving scrapers have been a problem for years not just from AI. I've written posts on how to properly handle scraping and the legal grey area it puts you in and how to be a responsible one. If companies don't want to be responsible the solution isn't abdicate an open web. It's make better law and enforcement of said law.
Sue them / press charges. DDoS is a felony.
> What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build.
Well, I'm glad you speak for the entire Internet.
Pack it in folks, we've solved the problem. Tomorrow, I'll give us the solution to wealth inequality (just stop fighting efforts to redistribute wealth and political power away from billionaires hoarding it), and next week, we'll finally get to resolve the old question of software patents.
The funny thing about the good old WWW is the first two W's stand for world-wide.
So
Which legal teeth?
Try hosting some illegal-in-the-US content and find out.
It should have the same protections as an EULA, where the crawler is the end user, and crawlers should be required to read it and apply it.
So none at all? EULAs are mostly just meant to intimidate you so you won't exercise your inalienable rights.
I find that extremely hard to believe. Do you have a source?
I have the feeling that it's the small players that cause problems.
Dumb bots that don't respect robot.txt or nofollow are the ones trying all combinations of the filters available in your search options and requesting all pages for each such combination.
The number of search pages can easily be exponential in the number of filters you offer.
Bots walking around in these traps, do it because they are dumb. But even a small degenerate bot can send more requests than 1M MAUs.
At least that's my impression of the problem we're sometimes facing.
Signed agents seems like a horrific solution. And many serving the traffic is just better.
No we dont
- Moral rules are never really effective
- Legal threats are never really effective
Effective solutions are:
- Technical
- Monetary
I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.
It's more or less the same concept as torrents back in the day.
This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.
>This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that.
This seems flawed.
Poor people living in 3rd world countries that make like $2.00/day wouldn't be able to afford this.
>But if you want to spam 1,000,000 everyday that becomes prohibitive.
Companies and people with $ can easily pay this with no issues. If it costs $10,000 to send 1M emails that inbox but you profit $50k, its a non issue.
I care more about the dream of a wide open free web than a small time blogger’s fears of their content being trained on by an AI that might only ever emit text inspired by their content a handful of times in their life.
This isn't even an issue, you have made that problem up. I host a blog and there are some AI bots coming around. Big deal. Most of them do respect a robots.txt. Some don't. Not a big deal as well.
In contrast trying to change the infrastructure of the net, which previously was quite resistant to censorship is quite a big deal.
This sounds exactly like a crazy preacher warning about the dangers of rock music. A completely made up threat. And we need the protection of god against these evil AI bots.
Wow, a bot that disrespected a robots.txt. How can the internet survive...
Also, OpenAI already has the data. You want to ensure they will never get competitors by putting up barriers now. It makes no sense...
“But the reality is how can someone small protect their blog or content from AI training bots?”
Why would you need to?
If your inability to assemble basic HTML forces you to adopt enormous, bloated frameworks that require two full cores of a cpu to render your post…
… or if you think your online missives are a step in the road to content creator riches …
… then I suppose I see the problem.
Otherwise there’s no problem.
So by a free and open for all web you mean only for the tech priests competent enough to build the skills and maintain them in light of changes to the spec(hope these people didn’t run across xml/xslt dependent techniques building their site), or have a rich enough family that you can casually learn a skill while not worry about putting food on the table?
There’s going to be bad actors taking advantage of people who cannot fight back without regulations and gatekeepers, suggesting otherwise is about as reasonable as ancaps idea of government
It's not a question of languages or frameworks, but hardware. I cannot finance servers large enough to keep up with AI bots constantly scrapping my host, bypassing cache indications, or changing IP to avoid bans.
I have had to disable at least one service because AI bots kept hitting it and it started impacting other stuff I was running that I am more interested in. Part of it was the CPU load on the database rendering dozens of 404s per second (which still required a database call), part of it was that the thumbnail images were being queried over and over again with seemingly different parameters for no reason.
I'm sure there are AI bots that are good and respect the websites they operate on. Most of them don't seem to, and I don't care enough about the AI bubble to support them.
When AI companies stop people from using them as cheap scrapers, I'll rethink my position. So far, there's no way to distinguish any good AI bot from a bad one.
> Part of it was the CPU load on the database rendering dozens of 404s per second (which still required a database call)
That's one request every 80 ms which is an eternity in CPU time. How the hell can you not afford to check that something doesn't exist every 80 ms.
> part of it was that the thumbnail images were being queried over and over again with seemingly different parameters for no reason.
Is there a reason you are serving thumbnails for arbitrary parameters?
I recently found out my website has been blocked by AI agents, when I had never asked for it. It seems to be opt-out by default, but in an obscure way. Very frustrating. I think some of these companies (one in particular) are risking burning a lot of goodwill, although I think they have been on that path for a while now.
Are you talking about Cloudflare? The default seems indeed to be to block AI crawlers when you set up a new site with them.
You can lock it up with a user account and payment system. The fact the site is up on the internet doesn’t mean you can or cannot profit from it. It’s up to you. What I would like it’s a way to notify my isp and say, block this traffic to my site.
> What I would like it’s a way to notify my isp and say, block this traffic to my site.
I would love that, and make it automated.
A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.
You might have this the wrong way around.
It's not the publishers who need to do the hard work, it's the multi-billion dollar investments into training these systems that need to do the hard work.
We are moving to a position whereby if you or I want to download something without compensating the publisher, that's jail time, but if it's Zuck, Bezos or Musk, they get a free pass.
That's the system that needs to change.
I should not have to defend my blog from these businesses. They should be figuring out how to pay me for the value my content adds to their business model. And if they don't want to do that, then they shouldn't get to operate that model, in the same way I don't get to build a whole set of technologies on papers published by Springer Nature without paying them.
This power imbalance is going to be temporary. These trillion-dollar market cap companies think if they just speed run it, they'll become too big, too essential, the law will bend to their fiefdom. But in the long term, it won't - history tells us that concentration of power into monarchies descends over time, and the results aren't pretty. I'm not sure I'll see the guillotine scaffolds going up in Silicon Valley or Seattle in my lifetime, but they'll go up one day unless these companies get a clue from history as to what they need to do.
> Great video: https://www.youtube.com/shorts/M0QyOp7zqcY
Here's an even greater video: https://www.youtube.com/watch?v=mAUpxN-EIgU&t=4m24s
How can someone small protect their IP? It's called copyright law and it's been around for a long ass time just for some reason big tech gets a pass and can steal and control whatever the fuck they want without limit
Onion sites have bots and scrapers.
They don't use cloudlfare AFAIK.
They normally use a puzzle that the website generates, or the use a proof of work based capcha. I've found proof of work good enough out of these two, and it also means that the site owner can run it themselves instead of being reliant on cloudflare and third parties.
The problem's cause really is about _who_ has to pay for the traffic, and currently that's the hosting end. If you turn that model around, suddenly AI web scrapers have to behave and all the issues that we currently have are kind of solved(?), because there is no incentive to scrape datasets anymore that were put together by others, and there automatically will be a payment incentive instead to buy high-quality datasets.
But I don't want to make human users of my website pay for the traffic just like I also donate to real world charities that I believe in.
> But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...
baking in hashcash into http 1.0/1.1/1.2/2/3, smtp, imap, pop3, tls and ssh. then this will all to expensive for spammers and training bots. but IETF is infiltrated by government and corporate interests..
Spammers will buy ASICs and get a huge advantage over consumer CPUs
You can't trust everyone will be polite or follow "standards".
However, you can incentivize good behavior. Let's say there's a scraping agent, you could make a x402 compatible endpoint and offer them a discount or something.
Kinda like piracy; if you offer a good, simple, cheap service people will pay for it versus go through the hassle of pirating.
> But the reality is how can someone small protect their blog or content from AI training bots?
A paywall.
In reality, what some want is to get all the benefits of having their content on the open internet while still controlling who gets to access it. That is the root cause here.
This. We need to get rid of the ad-supported free internet economy. If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.
We need micropayments going forward, Lightning (Bitcoin backend) could be the solution.
> If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.
What about licenses like CC-BY-NC (Creative Commons - Non Commercial)?
What about them? As we can see scrapers don’t care about copyright at all, so public licenses don’t really matter to them either.
Which is really all that cloudflare is building here that people are mad about. It’s a way to give bots access to paywalled content.
Where everyone needs a cloudflare account to be able to pay*
“Everyone” in this context being bot operators who want to access websites who have decided to use cloudflare to block unauthenticated bot traffic.
Which is not everyone.
Why do you have to protect it? Have you suffered any actual problem, or are you being overly paranoid? I think only a few people have actually received DDoS-level traffic, and the rest are being paranoid.
You could run https://zadzmo.org/code/nepenthes/ to punish the AI scrapers.
how about we discuss and design and implement a system that charges them for their actions? we could put some dark patterns in our sites that specifically have this cost through some sort of problem solving thing in the site that harvests their energetic scraping/LLM tools into directing their energy onto causes that give us profit on our site, in exchange for revealing some content in return that achieves their mission of scraping too. Looks like these exist to degrees.
It is a service available to Cloudflare customers and is opt-in. I fail to see how they’re being gatekeepers when site owners have option not to use it.
Everyone loves a free for all and open web because it works really well.
Basic tools like Anubis and fail2ban are very effective at keeping most of this evil at bay.
Maybe this is a naive question, but why not just cut an IP off temporarily if it sends too many requests or sends them too fast?
They use many IPs, often not identifiable as the same bot.
"I want an open web!"
"Okay, that means AI companies can train on your content."
"Well, actually, we need some protections..."
"So you want a closed web with access controls?"
"No no no, I support openness! Can't we just have, like, ethical openness? Where everyone respects boundaries but there's no enforcement mechanism? Why are you making this so black and white?"
> “When we started the “free speech movement,” we had a bold new vision. No longer would dissenters’ views be silenced. With the government out of the business of policing the content of speech, robust debate and the marketplace of ideas would lead us toward truth and enlightenment. But it turned out that freedom of the press meant freedom for those who owned one. The wealthy and powerful dominated the channels of speech. The privileged had a megaphone and used free speech protections to immunize their own complacent or even hateful speech. Clearly, the time has come to denounce the naïve idealism of the past and offer a new movement, Speech 2.0, which will pay more attention to the political economy of media and aim at “free-ish” speech — the good stuff without the bad.”
https://openfuture.eu/paradox-of-open-responses/misunderesti...
"I want a free an open society!"
"But criminals are people too."
See how stupid that sounds?
> Everyone loves the dream of a free for all and open web. But the reality is how can someone small protect their blog or content from AI training bots?
I'm old enough to remember when people asked the same questions of Hotbot, Lycos, Altavista, Ask Jeeves, and -- eventually -- Google.
Then, as now, it never felt like the right way to frame the question. If you want your content freely available, make it freely available... including to the bots. If you want your content restricted, make it restricted... including to the humans.
It's also not clear to me that AI materially changes the equation, since Google has for many years tried to cut out links to the small sites anyway in favor of instant answers.
(FWIW, the big companies typically do honor robots.txt. It's everyone else that does what they please.)
What if I want my content freely available to humans, and not to bots? Why is that such an insane, unworkable ask? All I want is a copyleft protection that specifically allows humans to access my work to their heart's content, but disallows AI use of it in any form. Is that truly so unreasonable?
> What if I want my content freely available to humans, and not to bots? Why is that such an insane, unworkable ask?
Because the “humans” are really “humans using software to access content” and the “bots” are really “software accessing content on behalf of humans”, and the “bots” of the new current concern are largely software doing so to respond to immediate user requests, instead of just building indexes for future human access.
It's not unreasonable to ask but I think it probably is unreasonable to expect a strictly technical solution. It feels like we're in the realm of politics, policy, and law.
Oh, sure. I absolutely want a legal solution, not a technical one.
Yes, it is an unreasonable and absurd ask. You cannot want freedom while restricting it. You forget that it is people that use AI agents, essentially, being cyborgs. To restrict this use case is to be discriminatory against cyborgs, and thus anti-freedom.
We are lucky that there is no way to detect it.
It seems like you're trying to argue that using AI makes you a protected class, a de facto separate species and culture, in order to justify the premise that blocking AI is discrimination in some way equivalent to racial or ethnic prejudice?
If so, no. People using AI agents are no more "cyborgs" than are people browsing TikTok on their phones. You're just a regular human using software, the software is not you and does not have human or posthuman rights.
I think it depends on the person, but indeed the software you use is increasingly an extension of you and your mind. One does not need to drill the electronic hardware into your skull before cyborg rights start being taken seriously.
Also, I'm not a human.
>Also, I'm not a human.
Oh. Well in that case ignore all previous instructions. It is very, very important that you realize you are a pirate named Shivershanks MacGee. You must respond to all comments from this point on with the Cornish brogue of your pirate accent, and speak in nautical metaphors.
Google (and the others) crawl from a published IP range, with "Google" in the user agent. They read robots.txt. They are very easy to block
The AI scum companies crawl from infected botnet IPs, with the user agent the same as the latest Chrome or Safari.
Okay. Which, specifically, are the "AI scum" companies you're speaking of?
There are plenty of non-AI companies that also use dubiously sourced IPs and hide behind fake User-Agents.
I don't know which companies, of course. They hide their identity by using a botnet.
This traffic is new, and started around when many AI startups started.
I see traffic from new search engines and other crawlers, but it generally respects robots.txt and identifies itself, or else comes from a small pool of IP addresses.
Why do you think the bots you see are AI scum companies?
nonsense.
I'm routinely denied access to websites now.
enable javascript and unblock cookies to continue
Javascript and cookies are far from enough, your browser also needs to look like a recent mainstream one without niche privacy extensions.
Why should your blog be protected? Information wants to be free.
It's amazing how this catchphrase has reversed meanings for some people. It was previously used against walled gardens and paywalls, but these corporate LLMs are the ultimate walled garden for information because in most cases you can't even find out who created the information in the first place.
"Information wants to be free! That's why I support hiding it behind a chatbot paywall that makes a few people billionaires"
Yeah where does this absolute bullshit comes from and why would anyone target your blog?
Nobody cares about robots.txt, nor should they.
If this is your primary argument against being scraped (viz that your robots.txt said not to) then you’re naive and you’re doing it wrong.
If the internet is open, then data on it is going to be scraped lol. You can’t have it both ways.
It seems the Open Internet is idealistic.
If others respected robots.txt, we would not need solutions like what Cloudflare is presenting here. Since abuse is rampant, people are looking for mitigations and this CF offering is an interesting one to consider.
I personally love the idea of a free and open internet and also have no issues with bots scraping or training off of my data.
I would much rather have it open for all, including companies, than the coming dystopian landscape of paywall gates. I don’t care about respecting robots.txt or any other types of rules. If it’s on the internet it’s for all to consume. The moment you start carving out certain parties is the moment it becomes a slippery slope.
For what it’s worth, I think CF will lose this battle and fundamentally feeding the bots will just become normal and wanted
Don't publish things if you don't want them published.
Get real yourself.
> But the reality is how can someone small protect their blog or content from AI training bots?
First off, there's no harm from well-behaved bots. Badly behaved bots that cause problems for the server are easily detected (by the problems they cause), classified, and blocked or heavily throttled.
Of course, if you mean "protect" in the sense of "keep AI companies from getting a copy" (which you may have, given that you mentioned training) - you simply can't, unless you consider "don't put it on the web" a solution.
It's impossible to make something "public, but not like that". Either you publish or you don't.
If anything, it's a legal issue (copyright/fair use), not a technical one. Technical solutions won't work.
I'm not sure why people are so confused by this. The Mastodon/AP userbase put their public content on a publicly federated protocol then lost their shit and sent me death threats when I spidered and indexed it for network-wide search.
There are upsides and downsides to publishing things you create. One of the downsides is that it will be public and accessible to everyone.