> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?
Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
No, that is not true. It is only true if you just equate "AI training bots" with "people" on some kind of nominal basis without considering how they operate in practice.
It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".
It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.
This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.
You have a problem with badly behaved scrapers, not AI.
I can't disagree with being against badly behaved scrapers. But this is neither a new problem or an interesting one from the idea of making information freely available to everyone, even rhinoceroses, assuming they are well behaved. Blocking bad actors is not the same thing as blocking AI.
The thing is that rhinoceroses aren't well-behaved. Even if some small fraction of them in theory might be well-behaved, the effort of trying to account for that is too small to bother. If 99% of rhinoceroses aren't well-behaved, the simple and correct response is to ban them all, and then maybe the nice ones can ask for a special permit. You switch from allow-by-default to block-by-default.
Similarly it doesn't make sense to talk about what happens if AI bots were well-behaved. If they are, then maybe that would be okay, but they aren't, so we're not talking about some theoretical (or past) situation where bots were well-behaved and scraped in a non-disruptive fashion. We're talking about the present reality in which there actually are enormous numbers of badly-behaved bots.
Incidentally, I see that in a lot of your responses on this thread you keep suggesting that people's problem is "not with AI" but with something else. But look at your comment that I initially replied to:
> Blocking AI training bots is not free and open for all.
We're not talking about "AI". We're talking about AI training bots. If people want to develop AI as a theoretical construct and train it on datasets they download separately in a non-disruptive way, great. (Well, actually it's still terrible, but for other reason. :-) ) But that's not what people are responding to in this thread. They're talking about AI training bots that scrape websites in a way that is objectively more harmful than previous generations of scrapers.
A rhino can't not be huge and destructive and humans can't not be shitty and selfish. Badly behaved scrapers are simply an inevitable fact of the universe and there's no point trying to do anything because it's an immutable law of reality and can never be changed, so don't bother to try
ISPs are supposed to disconnect abusive customers. The correct thing to do is probably contact the ISP. Don't complain about scraping, complain about the DDOS (which is the actual problem and I'm increasingly beginning to believe the intent.)
Great! How do I get, say, Google's ISP to disconnect them?
Every ISP has an abuse email contact you can look up.
Sure, let me just contact that one ISP located in Russia or India, I am sure they will care a lot about my self-hosted blog
Hence the need for Cloudflare?
I am not comfortable with a private company being the only solution, especially when they have a history of deplatforming sites.
Except that's exactly what you should do. And if they refuse to cooperate you contact the network operators between them and yourself.
Imagine if Chinese or Russian criminal gangs started sending mail bombs to the US/EU and our solution would be to require all senders, including domestic ones, to prove their identity in order to have their parcels delivered. Completely absurd, but somehow with the Internet everyone jumps to that instead of more reasonable solutions.
The internet is not a mirror of the real world.
But many people feel that the very act of incorporating your copyrighted words into their for-profit training set is itself the bad behavior. It's not about rate-limiting scrapers, it's letting them in the door in the first place.
Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.
The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
They were okay with it when Google was sending them traffic. Now they often don’t. They’ve broken the social contract of the web. So why should the sites whose work is being scraped be expected to continue upholding their end?
Not only are they scraping without sending traffic, they're doing so much more aggressively than Google ever did; Google, at least, respected robots.txt and kept to the same user-agent. They didn't want to index something that a server didn't want indexed. AI bots, on the other hand, want to index every possible thing regardless of what anyone else says.
There's something more obviously nefarious and existential about AI. It takes the idea of "you are the product" to a whole new level.
> Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?
It wasn't okay, it's just that the reasons it wasn't okay didn't become apparent until later.
> The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.
Many of those people will likely have a problem with it later, for reasons that are happening now but that they won't become fully aware of until later.
[flagged]
[flagged]
[flagged]
Sure. But we're already talking about presumption of free and open here. I'm sure people are also reading my words and incorporating it into their own for-profit work. If I cared, I wouldn't make it free and open in the first place.
But that is not something you can protect against with technical means. At beast you can block the little fish and give even more power to the mega corporations who will always have a way to get to the data - either by operating crawlers you cannot afford to block, incentivizing users to run their browsers and/or extensions that collect the data and/or buying the data from someone who does.
All you end up doing is participating in the enshittification of the web for the rest of us.
Badly behaved scrapers are not a new problem, but badly behaved scrapers run by multibillion-dollar companies which use every possible trick to bypass every block or restriction or rate limit you put in front of them is a completely new problem on a scale we've never seen before.
> You have a problem with badly behaved scrapers, not AI.
And you have a problem understanding that "freedom and openness" extend only to where the rights (e. g. the freedom) of another legal entity begins. When I don't want "AI" (not just the badly-behaved subset) rifling my website then I should be well within my rights to disallow just that, in the same way as it's your right to allow them access to your playground. It's not rocket science.
This is not what the parent means. What they mean is such behavior is a hypocrisy. Because you are getting access to truly free websites whose owners are interested in having smart chatbots trained on the free web, but you are blocking said chatbots while touting "free Internet" message.
AI is one of those bad actors
But we should also not throw out the baby with the bathwater. All these attempts at blocking AI bots also block other kinds of crawlers as well as real users with niche browsers.
Meanwhile if you are concerned with the parasitic nature of AI companies then no technical measure will solve that. As you have already noted, they can just buy your data from someone else who you can't afford to block - Google, users with a browser extension that records everything, bots that are ahead of you in the game of cat and mouse, etc.
> "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?"
What this scenario actually reveals is that the words "open to the public" are not intended to mean "access is completely unrestricted".
It's fine to not want to give completely unrestricted access to something. What's not fine, or at least what complicates things unnecessarily, is using words like "open and free" to describe this desired actually-we-do-want-to-impose-certain-unstated-restrictions contract.
I think people use words like "open and free" to describe the actually-restricted contracts they want to have because they're often among like-minded people for whom these unstated additional restrictions are tacitly understood -- or, simply because it sounds good. But for precise communication with a diverse audience, using this kind of language is at best confusing, at worst disingenuous.
Nobody has ever meant "access is completely unrestricted".
As a trivial example: what website is going to welcome DDoS attacks or hacking attempts with open arms? Is a website no longer "open to the public" if it has DDoS protection or a WAF? What if the DDoS makes the website unavailable to the vast majority of users: does blocking the DDoS make it more or less open?
Similarly, if a concert is "open to the public", does that mean they'll be totally fine with you bringing a megaphone and yelling through the performance? Will they be okay with you setting the stage on fire? Will they just stand there and say "aw shucks" if you start blocking other people from entering?
You can try to rules-lawyer your way around commonly-understood definitions, but deliberately and obtusely misinterpreting such phrasing isn't going to lead to any kind of productive discussion.
>You can try to rules-lawyer your way around commonly-understood definitions
Despite your assertions to the contrary, "actually free to use for any purpose" is a commonly understood interpretation of "free to use for any purpose" -- see permissive software licenses, where licensors famously don't get to say "But I didn't mean big companies get to use it for free too!"
The onus is on the person using a term like "free" or "open" to clarify the restrictions they actually intend, if any. Putting the onus anywhere else immediately opens the way for misunderstandings, accidental or otherwise.
To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert. They do only the things an ordinary member of the public do; they can't do anything else. The most "damage" they can do is to keep humans who would enjoy the concert from being able to attend if there aren't enough seats; whatever additional costs they cause (air conditioning, let's say) are the same as the costs that would have been incurred by that many humans.
> To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert.
The scraper is sending ten million robots to your concert. They're packing out every area of space, they're up on the stage, they're in all the vestibules and toilets even though they don't need to go. They've completely crowded out all the humans, who were the ones who actually need to see the concert.
You'd have been fine with a few robots. It used to be the case that companies would send one robot each, and even though they were videotaping, they were discreet about it and didn't get in the humans way.
Now some imbecile is sending millions of robots, instead of just one with a video camera. All the robots wear the scraper's company uniform at first, so to deal with this problem you tell all robots wearing it to go home. Then they all come back dressed identically to the humans in the queue, as they jump ahead of them, to deliberately disguise who they are because they know you'll kick them out. They're not taking no for an answer, and they're going to use their sheer mass and numbers to block out your concert. Nobody seems to know why they do it, and nobody knows who is sending the robots for sure, because robot owners are all denying it's theirs. But somebody is sending them.
Using "open and free" to mean "I actually want no restrictions at all" is also confusing and disingenuous, because, as you yourself point out, a lot of people don't mean that by those words.
The other thing, though, is that there's a difference between "I personally want to release my personal work under open, free, and unrestricted terms" and "I want to release my work into a system that allows people to access information in general under open, free, and unrestricted terms". You can't just look at the individual and say "Oh, well, the conditions you want to put on your content mean it's not open and free so you must not actually want openness and freedom". You have to look at the reality of the entire system. When bots are overloading sites, when information is gated behind paywalls, when junk is firehosed out to everyone on behalf of paid advertisers while actual websites are down on page 20 of the search results, the overall situation is not one of open and free information exchange, and it's naive to think that individuals simply dumping their content "openly and freely" into this environment is going to result in an open and free situation.
Asking people to just unilaterally disarm by imposing no restrictions, while other less noble actors continue to impose all sorts of restrictions, will not produce a result that is free of restrictions. In fact quite the opposite. In order to actually get a free and open world in the large, it's not sufficient for good actors to behave in a free and open manner. Bad actors also must be actively prevented from behaving in an unfree and closed manner. Until they are, one-sided "gifts" of free and open content by the good actors will just feed the misdeeds of the bad actors.
> Asking people to just unilaterally disarm by imposing no restrictions
I'm not asking for this. I'm asking for people who want such restrictions (most of which I consider entirely reasonable) to say so explicitly. It would be enough to replace words like "free" or "open" with "fair use", which immediately signals that some restrictions are intended, without getting bogged down in details.
Why? It seems you already know what people mean by "open and free", and it does have a connection to the ideals of openness and freedom, namely in the systemic context that I described above. So why bother about the terminology?
What people mean by words like "open" and "free" varies. It varies a lot, and a lot turns on what they actually mean.
The only sensible way forward is to be explicit.
Why fight this obvious truth? Why does it hurt so much to say what you mean?
You can always stop bots. Add login/password. But people want their content to be accessible to as large audience as possible, but at the same time they don't want that data to be accessible to the same audience via other channels. logic. Bots are not consuming your data - humans are. At the end of the day humans will eventually read it and take actions. For example chatgpt will mention your site, the user will visit it.
And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
> And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.
I'm sorry, but this statement shows you have no recent experience with these crawlernets.
Google, from the beginning, has done their best to work with server owners. They respect robots.txt. I think they were the first to implement Crawl-Delay. They crawl based on how often things change anyway. They have an additional safeguard that when they notice a slowdown in your responses, they back off.
Compare this with Anthropic. On their website they say they follow robots.txt and Crawl-Delay. I have an explicit ban on Claudebot in there and a Crawl-Delay for everyone else. It ignores both. I send an email to them about this, and their answer didn't address the discrepancy between the docs and the behaviour. They just said they'll add me to their internal whitelist and that I should've sent 429s when they were going too fast. (Fuck off, how about you follow your public documentation?)
That's just my experience, but if you Google around you'll find that Anthropic is notorious for ignoring robots.txt.
And still, Claudebot is one of the better behaved bots. At least they identify themselves, have a support email they respond to, and use identifiable IP-addresses.
A few weeks ago I spend four days figuring out why I had 20x the traffic I normally have (which maxed out the server; causing user complaints). Turns out there are parties that crawl using millions of (residential) IPs and identify themselves as normal browsers. Only 1 or 2 connections per IP at the time. Randomization of identifying properties. Even Anthropics 429 solution wouldn't have worked there.
I managed to find a minor identifying property in some of the requests that wasn't catching too many real users. I used that to start firewalling IPs on sight and then their own randomization caused every IP to fall into the trap in the end. But it took days.
In the end I had to firewall nearly 3 million non-consecutive IP addresses.
So no, Google in 1996 or 2006 or 2016 is not the same as the modern DDoSing crawlernet.
I am still a bit confused by what some of these crawlers are getting out of it; repeatedly crawling sites that haven't changed seems to be the norm for the current crawlernets, which seems like a massive waste of resources on their end for what is, on average, data of rather indifferent quality.
Nothing. They're not designed to be useful. They're designed to grab as much data as possible and they'll figure out what to do with it later - they don't know it's mostly useless yet.
Tarpits are cool.
Did you send any abuse reports to the ASNs for those IP addresses?
they're basically describing the tragedy of the commons, but if a handful of the people have bulldozers to rip up all the grass and trees.
We can't have nice things because the powerful cannot be held accountable. The powerful are powerful due to their legal teams and money, and power is the ability to carve exceptions to rules.
Bingo. Thanks for clarifying exactly my point
That's a very "BSD is freedom and GPL isn't" kind of philosophy.
Nothing is truly free unless you give equal respect to fellow hobbyists and megacorps using your labor for their profit.
GPL doesn't care if you use it for profit or not (good), it just says that the resultant model needs to be open too. And open models exist in droves nowadays. Even closed models can be distilled into open ones.
>You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.
It's perfectly legit to want to have a "free and open for all except big corporations and AI engines".
I think that was the point. Everyone loves the dream, but the reality is different.
How so? If you don't want AI bots reading information on the web, you don't actually want a free and open web. The reality of an open web is that such information is free and available for anyone.
> If you don't want AI bots reading information on the web, you don't actually want a free and open web.
This is such a bad faith argument.
We want a town center for the whole community to enjoy! What, you don't like those people shooting up drugs over there? But they're enjoying it too, this is what you wanted right? They're not harming you by doing their drugs. Everyone is enjoying it!
If an AI bot is accessing my site the way that regular users are accessing my site -- in other words everyone is using the town center as intended -- what is the problem?
Seems to be a lot of conflating of badly coded (intentionally or not) scrapers and AI. That is a problem that predates AI's existence.
So if I buy a DDoS service and DDoS your site, it's ok as long as it accesses it the same way regular people do? In sorry for extreme example, it's obviously not, but that's how I understand your position as written.
We can also consider 10 exploit attempts per second that my site sees.
The issue is that people seem to be conflating badly built scraper bots with AI. If an AI accessed my site as frequently as a normal human (or say Googlebot) then that particular complaint merely goes away. It never had anything to do with AI itself.
Unironically, if we want everyone to enjoy the town center, we should let people do drugs.
Set aside that there's a pretty big difference between AI scraping and illegal drug usage.
If the person using illegal drugs is on no way harming anyone but themselves and not being a nuisance, then yeah, I can get behind that. Put whatever you want in your body, just don't let it negatively impact anyone around you. Seems reasonable?
I think this is actually a good example despite how stark the differences are - both the nuisance AI scrapers and the drug addicts have negative externalities that while possible for them to self regulate, they are for whatever reasons proving unable to do so, and therefore cause other people to have a bad time.
Other commenters saying the usual “drugs are freedom” type opinions, but now having lived in China and Japan where drugs are dealt with very strictly (and basically don’t have a drug problem today), I can see the other side of the argument where in fact places feeling dirty and dangerous because of drugs - even if you think of addicts sympathetically as victims who need help - makes everyone else less free to live the lifestyle they would like to have.
More freedom for one group (whether to ruin their own lives for a high; or to train their AI models) can mean less freedom for others (whether to not feel safe walking in public streets; or to publish their little blog in the public internet).
> just don't let it negatively impact anyone around you.
Exactly! Which is why we don't want AI bots siphoning our bandwidth & processing power.
Clearly you don't want the whole community to enjoy it then. Openness is incompatible with keeping the riff raff out
It isn't incompatible at all. You might also be shocked to learn that all you can eat buffets will kick you out if you grab all the food and dump it on your table.
> information is free and available for anyone.
Bots aren't people.
You can want public water fountains without wanting a company attaching a hose to the base to siphon municipal water for corporate use, rendering them unusable for everyone else.
You can want free libraries without companies using their employees' library cards to systematically check out all the books at all times so they don't need to wait if they want to reference one.
> Bots aren't people.
I am though and I get blocked by these bot checks all the time.
Buddha, what makes us human?
That's simple, running up to date Chrome on with javascript enabled does.
I want to be able to enjoy water fountains and libraries without having to show my ID. Somehow we are able to police those via other means, so let's not shit up the web with draconian measures either.
Does allow bots to access my information prevent other people from accessing my information? No. If it did, you'd have a point and I would be against that. So many strange arguments are being made in this thread.
Ultimately it is the users of AI (and am I one of them) that benefit from that service. I put out a lot of open code and I hope that people are able to make use of it however they can. If that's through AI, go ahead.
> Does allow bots to access my information prevent other people from accessing my information? No.
Yes it does, that's the entire point.
The flood of AI bots is so bad that (mainly older) servers are literally being overloaded and (newer servers) have their hosting costs spike so high that it's unaffordable to keep the website alive.
I've had to pull websites offline because badly designed & ban-evading AI scraper bots would run up the bandwidth into the TENS OF TERABYTES, EACH. Downloading the same jpegs every 2-3 minutes into perpetuity. Evidently all that vibe coding isn't doing much good at Anthropic and Perplexity.
Even with my very cheap transfer racks up $50-$100/mo in additional costs. If I wanted to use any kind of fanciful "app" hosting it'd be thousands.
I'm still very confused by who is actually benefitting from the bots; from the way they behave it seems like they're wasting enormous amounts of resources on both ends for something that could have been done massively more efficiently.
That's a problem with scrapers, not with AI. I'm not sure why there are way more AI scraper bots now than there were search scraper bots back when that was the new thing. However that's still an issue of scapers and rate limiting and nothing to do with wanting or not wanting AI to read your free and open content.
This whole discussion is about limiting bots and other unwanted agents, not about AI specifically (AI was just an obvious example)
Do the AI training bots provide free access to the distillation of the content they drain from my site repeatedly? Don't they want a free and open web?
I don’t feel a particular need to subsidize multi–billion even trillion dollar corporations with my content, bandwidth, and server costs since their genius vibe coded bots apparently don’t know how to use modified-GETs or caching, let alone parse and respect robots.txt.
Is the problem they exist or the problem they are badly accessing your site? Because there are two conflating issues here. If humans or robots are causing you issues, as both can do, that's bad. But that has nothing to do with AI in particular.
Problem one is they do not honor the conventions of the web and abuse the sites. Problem two is they are taking content for free, distilling it into a product, and limiting access to that product.
Problem one is not specific to AI and not even about AI.
Problem two is not anything new. Taking freely available content and distilling it into a product is something valuable and potentially worth paying for. People used to buy encyclopedias too. There are countless examples.
Problem one _is_ about AI.
It was a similar problem with cryptocurrencies. Out comes some kind of tech thingy, and a million get-rich-quick scammers pop out of the woodwork and start scamming left, right and center. Suddenly everyone's in on the hustle, everyone's cryptomining, or taking over computers and using them for cryptomining, they're setting the world on fire with electricity consumption through the roof just to fight against other people (who they wouldn't need to fight against if they'd just cooperate).
A vision. A gold rush. A massive increase in shitty human behaviour motivated by greed.
And now here we are again with AI. Massive interest. Trillions of dollars being sloshed around, everyone hustling to develop something so they'll get picked and flooded with cash. An enormous pile of deeply unethical and disrespectful behaviour by people who are doing what they're doing because that's where the money is. The AI bubble.
At present, problem one is almost entirely AI companies.
There's actually not much evidence of this, since the attack traffic is anonymous.
HN people working in these AI companies have commented to say they do this, and the timing correlates with the rise of AI companies/funding.
I haven't tried to find it in my own logs, but others have said blocking an identifiable AI bot soon led to the same pattern of requests continuing through a botnet.
Did HN people present evidence?
And a few decades ago, it would have been search engine scrapers instead.
And that problem was largely solved by robots.txt. AI scrapers are ignoring robots.txt and beating the hell out of sites. Small sites that have decades worth of quality information are suffering the most. Many of the scrapers are taking extreme measures to avoid being blocked, like using large numbers of distinct IP addresses (perhaps using botnets).
The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
Freedom, the word, while implies no boundaries, is always bound by ethics, mutual respect and "do no harm" principle. The moment you trip either one of these wires and break them, the mechanisms to counter it becomes active.
Then we cry "but, freedom?!". Freedom also contains the consequences of one's actions.
Freedom without consequences is tyranny of the powerful.
The problem isn't "AI bot scraping while disregarding all licenses and ethical considerations". The problem is "AI bot scraping while ignoring every good practice to reduce bandwidth usage".
If you ask me "every good practice to reduce bandwidth usage" falls under ethics pretty squarely, too.
While this is certainly a problem, it's not the only problem.
> The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".
What licenses? Free and open web. Go crazy. What ethical considerations? Do I police how users use the information on my site? No. If they make a pipe bomb using an 6502 CPU using code taken from my website -- am I supposed to do something about that?
Creative Commons, GFDL, Unlicense, GPL/AGPL, MIT, WTFPL. Go crazy. I have the freedom to police how users use the information on my site. Yes.
Real examples: My blog is BY-NC-SA and digital garden is GFDL. You can't take them, mangle and sell them. Especially, the blog.
AI companies take these posts, and sell derivatives, without any references, consent or compensation. BY-NC-SA is complete opposite of what they do.
This is why I'm not uploading any photos I take publicly anymore.
Absolutely. If you want to put all kinds of copyright, license, and even payment restrictions on your content go ahead. And if AI companies or people abuse that, that's bad on them.
But I do think if you're serious about free and open information than why are you doing that in the first place? It's perfectly reasonable to be restrictive; I write both very open software and very closed software. But I see a lot of people want to straddle the line when it comes to AI without a rational argument.
Let me try to make my point as compact as possible. I may fail, but please bear with me.
I prefer Free Software to Open Source software. My license of choice is A/GPLv3+. Because, I don't want my work to be used by people/entities in a single sided way. The software I put out is the software I develop for myself, with the hope of being useful for somebody else. My digital garden is the same. My blog is a personal diary in the open. These are built on my free time, for myself, and shared.
See, permissive licenses are for "developer freedom". You can do whatever you do with what you can grab, as long as you write a line to credits. A/GPL family is different. Wants reciprocity. It empowers the user vs. the developer. You have to give the source. Who modifies the source, shares the modifications. It stays in the open. It has to stay open.
I demand this reciprocity for what I put out there. The licenses reflect that. It's "restricting the use to keep the information/code open". I share something I spent my time on, and I want it to live on the open, want a little respect for putting out what I did. That respect is not fame or superiority. Just not take it and run with it, keeping all the improvements to yourself.
It's not yours, but ours. You can't keep it to yourself.
When it comes to AI, it's an extension of this thinking. I do not give consent to a faceless corporation to close, twist and earn money from what I put out for public good. I don't want a set of corporations act as a middleman to get what I put out, repackage and corrupt it in the process and sell it. It's not about money; it's about ethics, doing the right thing and being respectful. It's about exploitation. Same is applicable to my photos.
I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies. I equally get angry when a company's source available code is scraped and used for suggestions as well as an academic's LGPL high performance matrix library which is developed via grants over the years. This thing affect livelihoods of people.
I get angry when people say "if we take permission for what we do, AI industry will collapse", or "this thing just learns like humans, this is fair use".
I don't buy their "we're doing something awesome, we need no permission" attitude. No, you need permission to use my content. Because I say so. Read the fine print.
I don't want knowledge to be monopolized by these corporations. I don't want the small fish to be eaten by the bigger one and what remains is buried into the depths of information ocean.
This is why I stopped sharing my photos for now, and my latest research won't be open source for quite some time.
What I put out is for humans' direct consumption. Middlemen are not welcome.
If you have any questions or left any holes up there, please let me know.
I respect the desire for reciprocity, but strong copyleft isn't the only, or even the best, way to protect user freedom or public knowledge. My opinion is that permissive licensing and open access to learn from public materials have created enormous value precisely because they don't pre-empt future uses. Requiring permission for every new kind of reuse (including ML training) shrinks the commons, entrenches incumbents who already have data deals, and reduces the impact of your work. The answer to exploitation is transparency, attribution, and guardrails against republication, not copyright enforced restrictions.
I used to be much more into the GPL than I am now. Perhaps it was much more necessary decades ago or perhaps our fears were misguided. I license all my own stuff as Apache. If companies want to use it, great. It doesn't diminish what I've done. But those who prefer GPL, I completely understand.
> as well as an academic's LGPL high performance matrix library which is developed via grants over the years.
The academic got paid with grants. So now this high performance library exists in the world, paid for by taxes, but it can't be used everywhere. Why is it bad to share this with everyone for any purpose?
> What I put out is for humans' direct consumption. Middlemen are not welcome.
Why? Why must it be direct consumption? I've use AI tools to accomplish things that I wouldn't be able to do on my own in my free time -- work that is now open source. Tons of developers this week are benefiting from what I was able to accomplish using a middle man. Not all middlemen, by definition, are bad. Middlemen can provide value. Why is that value not welcome?
> I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies.
If you define AI/LLM/Generative technology/etc as the exploitation of exploitation of people, artists, musicians, software developers, other companies then you are against it. As software developers our work directly affects the livelihoods of people. Everything we create is meant to automate some human task. To be a software developer and then complain that AI is going to take away jobs is to be a hypocrite.
Your whole argument is easily addressed by requiring the AI models to be open source. That way, they obviously respect the AGPL and any other open license, and contribute to the information being kept free. Letting these companies knowingly and obviously infringe licenses and all copyright as they do today is obviously immoral, and illegal.
AGPL doesn't pre-empt future uses or require permission for any kind of re-use. You just have to share alike. It's pretty simple.
AGPL lets you take a bunch of data and AI-train on it. You just have to release the data and source code to anyone who uses the model. Pretty simple. You don't have to rent them a bunch of GPUs.
Actually it can be annoying because of the specific mechanism by which you have to share alike - the program has to have a link to its own source code - you can't just offer the source alongside the binary. But it's doable.
How is it available for everyone if the AI bots bring down your server?
Is that really the problem we are discussing? I've had people attack my server and bring it down. But that has nothing to do with being free and open to everyone. A top hacker news post could take my server.
Yes, because a top hacker news post takes your server down because a large number of actual humans are looking to gain actual value from your posts. Meanwhile, you stand to benefit from the HN discussion by learning new things and perspectives from the community.
The AI bot assault, on the other hand, is one company (or a few companies) re-fetching the same data over and over again, constantly, in perpetuity, just in case it's changed, all so they can incorporate it into their training set and make money off of it while giving you zero credit and providing zero feedback.
But then we get to use those AI tools.
The refrain here comes down not to "AI" but mostly to "the AI bot assault" which is a different thing. Sure lets have an discussion about badly behaved and overzealous web scrapers. As for credit, I've asked AI for it's references and gotten them. If my information is merely mushed into AI training model I'm not sure why I need credit. If you discuss this thread with your friends are you going to give me credit?
No, you don't "get to" use the AI tools. You have to buy access to them (beyond some free trials).
Yes. I get to buy access to them. They're providing an expensive to provide service that requires specialized expertise. I don't see the problem with that.
"If you discuss this thread with your friends are you going to give me credit?"
Yes. How else would I enable my friends to look it up for themselves?
6 months from now when you've internalized this entire thread are you even going to remember where you got it from?
Why are you shifting the discussion by adding two new variables (time/memory)?
Because that's how one interacts with AI.
Yeah. Running out of arguments, are you?
[dead]
Ultimately, you have to realize that this is a losing battle, unless we have completely draconian control over every piece of silicon. Captchas are being defeated; at this point they're basically just mechanisms to prove you Really Want to Make That Request to the extent that you'll spend some compute time on it, which is starting to become a bit of a waste of electricity and carbon.
Talented people that want to scrape or bot things are going to find ways to make that look human. If that comes in the form of tricking a physical iPhone by automatically driving the screen physically, so be it; many such cases already!
The techniques you need for preventing DDoS don't need to really differentiate that much between bots and people unless you're being distinctly targeted; Fail2Ban-style IP bans are still quite effective, and basic WAF functionality does a lot.
Agreed, copyright issues need to be solved via legislation and network abuse issues need to be solved by network operators. Trying to run around either only makes the web worse for everyone.
Rate-limits? Use a CDN? Lots of traffic can be a problem whether it's bots or humans.
You realize this entire thread is about a pitch from a CDN company trying to solve an issue that has presented itself at such a scale that this is the best option they can think of to keep the web alive, right?
"Use a CDN" is not sufficient when these bots are so incredibly poorly behaved, because you're still paying for that CDN and this bad behavior is going to cost you a fortune in CDN costs (or cost the CDN a fortune instead, which is why Cloudflare is suggesting this).
Everyone can get it from the bots now?
Build better
Nothing is „free“. AI bots eat up my blog like crazy and I have to pay for its hosting.
Don't you have rate-limits? And how much are you paying for the instance where you're hosting it? I've run/helped run projects with something like ~10 req/s easily on $10 VPSs, surely hosting HTML can't cost you that much?
Of course it won't be free, but you can get pretty close to free but employing the typical things you'd put in place to restrict the amount of resources used, like rate-limits, caches and so on.
And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Why is outsourcing this to Cloudflare bad and doing it yourself ok? Am I allowed to buy a license to a rate limiter or do I need to code my own? Am I allowed to use a firewall or is blocking people from probing my server not free enough?
Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
> And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.
Where are people getting this from? No, Cloudflare or any other CDN is not required for you to host your own stuff. Sure, it's easy, and probably the best way to go if you just wanna focus on shipping, but lets not pretend it's a requirement today.
> Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point
I don't think they are, that's why we have rate limiters, right? :) I think the point is that if you're allowing a user to access some content in one way, why not allow that same user to access the content in the same way, but using a different user-agent? That's the original purpose of that header after all, to signal what the user used as an agent on their behalf. Commonly, I use Firefox as my agent for browsing, but I should be free to use any user-agent, if we want the web to remain open and free.
My point is that people choose to outsource the complexity of running a rate limiter and blocking bad actors to Cloudflare and others like them is not the issue you make it out to be.
Why is it good for me to do it myself but bad to pay Cloudflare $20 a month to do it for me. No one is forcing me to use their services. I still have the option to do it myself, or use someone else, or not use anything at all. Seems pretty free to me.
Many AI scraping bots are notoriously bad actors and are hammering sites. Please don’t pretend they are all or even mostly well behaved. We didn’t have this push with the search engine scraping bots as those were mostly well behaved.
You are setting up a straw man with a “hey why not let this hypothetical we’ll behaved bot in”. That isn’t the argument or reality. We didn’t have the need to block Google, Yahoo, or Bings bot because they respected robots.txt and had a reasonable frequency of visits.
[dead]