I think that was the point. Everyone loves the dream, but the reality is different.

How so? If you don't want AI bots reading information on the web, you don't actually want a free and open web. The reality of an open web is that such information is free and available for anyone.

> If you don't want AI bots reading information on the web, you don't actually want a free and open web.

This is such a bad faith argument.

We want a town center for the whole community to enjoy! What, you don't like those people shooting up drugs over there? But they're enjoying it too, this is what you wanted right? They're not harming you by doing their drugs. Everyone is enjoying it!

If an AI bot is accessing my site the way that regular users are accessing my site -- in other words everyone is using the town center as intended -- what is the problem?

Seems to be a lot of conflating of badly coded (intentionally or not) scrapers and AI. That is a problem that predates AI's existence.

So if I buy a DDoS service and DDoS your site, it's ok as long as it accesses it the same way regular people do? In sorry for extreme example, it's obviously not, but that's how I understand your position as written.

We can also consider 10 exploit attempts per second that my site sees.

The issue is that people seem to be conflating badly built scraper bots with AI. If an AI accessed my site as frequently as a normal human (or say Googlebot) then that particular complaint merely goes away. It never had anything to do with AI itself.

Unironically, if we want everyone to enjoy the town center, we should let people do drugs.

Set aside that there's a pretty big difference between AI scraping and illegal drug usage.

If the person using illegal drugs is on no way harming anyone but themselves and not being a nuisance, then yeah, I can get behind that. Put whatever you want in your body, just don't let it negatively impact anyone around you. Seems reasonable?

I think this is actually a good example despite how stark the differences are - both the nuisance AI scrapers and the drug addicts have negative externalities that while possible for them to self regulate, they are for whatever reasons proving unable to do so, and therefore cause other people to have a bad time.

Other commenters saying the usual “drugs are freedom” type opinions, but now having lived in China and Japan where drugs are dealt with very strictly (and basically don’t have a drug problem today), I can see the other side of the argument where in fact places feeling dirty and dangerous because of drugs - even if you think of addicts sympathetically as victims who need help - makes everyone else less free to live the lifestyle they would like to have.

More freedom for one group (whether to ruin their own lives for a high; or to train their AI models) can mean less freedom for others (whether to not feel safe walking in public streets; or to publish their little blog in the public internet).

> just don't let it negatively impact anyone around you.

Exactly! Which is why we don't want AI bots siphoning our bandwidth & processing power.

Clearly you don't want the whole community to enjoy it then. Openness is incompatible with keeping the riff raff out

It isn't incompatible at all. You might also be shocked to learn that all you can eat buffets will kick you out if you grab all the food and dump it on your table.

> information is free and available for anyone.

Bots aren't people.

You can want public water fountains without wanting a company attaching a hose to the base to siphon municipal water for corporate use, rendering them unusable for everyone else.

You can want free libraries without companies using their employees' library cards to systematically check out all the books at all times so they don't need to wait if they want to reference one.

> Bots aren't people.

I am though and I get blocked by these bot checks all the time.

Buddha, what makes us human?

That's simple, running up to date Chrome on with javascript enabled does.

I want to be able to enjoy water fountains and libraries without having to show my ID. Somehow we are able to police those via other means, so let's not shit up the web with draconian measures either.

Does allow bots to access my information prevent other people from accessing my information? No. If it did, you'd have a point and I would be against that. So many strange arguments are being made in this thread.

Ultimately it is the users of AI (and am I one of them) that benefit from that service. I put out a lot of open code and I hope that people are able to make use of it however they can. If that's through AI, go ahead.

> Does allow bots to access my information prevent other people from accessing my information? No.

Yes it does, that's the entire point.

The flood of AI bots is so bad that (mainly older) servers are literally being overloaded and (newer servers) have their hosting costs spike so high that it's unaffordable to keep the website alive.

I've had to pull websites offline because badly designed & ban-evading AI scraper bots would run up the bandwidth into the TENS OF TERABYTES, EACH. Downloading the same jpegs every 2-3 minutes into perpetuity. Evidently all that vibe coding isn't doing much good at Anthropic and Perplexity.

Even with my very cheap transfer racks up $50-$100/mo in additional costs. If I wanted to use any kind of fanciful "app" hosting it'd be thousands.

I'm still very confused by who is actually benefitting from the bots; from the way they behave it seems like they're wasting enormous amounts of resources on both ends for something that could have been done massively more efficiently.

That's a problem with scrapers, not with AI. I'm not sure why there are way more AI scraper bots now than there were search scraper bots back when that was the new thing. However that's still an issue of scapers and rate limiting and nothing to do with wanting or not wanting AI to read your free and open content.

This whole discussion is about limiting bots and other unwanted agents, not about AI specifically (AI was just an obvious example)

Do the AI training bots provide free access to the distillation of the content they drain from my site repeatedly? Don't they want a free and open web?

I don’t feel a particular need to subsidize multi–billion even trillion dollar corporations with my content, bandwidth, and server costs since their genius vibe coded bots apparently don’t know how to use modified-GETs or caching, let alone parse and respect robots.txt.

Is the problem they exist or the problem they are badly accessing your site? Because there are two conflating issues here. If humans or robots are causing you issues, as both can do, that's bad. But that has nothing to do with AI in particular.

Problem one is they do not honor the conventions of the web and abuse the sites. Problem two is they are taking content for free, distilling it into a product, and limiting access to that product.

Problem one is not specific to AI and not even about AI.

Problem two is not anything new. Taking freely available content and distilling it into a product is something valuable and potentially worth paying for. People used to buy encyclopedias too. There are countless examples.

Problem one _is_ about AI.

It was a similar problem with cryptocurrencies. Out comes some kind of tech thingy, and a million get-rich-quick scammers pop out of the woodwork and start scamming left, right and center. Suddenly everyone's in on the hustle, everyone's cryptomining, or taking over computers and using them for cryptomining, they're setting the world on fire with electricity consumption through the roof just to fight against other people (who they wouldn't need to fight against if they'd just cooperate).

A vision. A gold rush. A massive increase in shitty human behaviour motivated by greed.

And now here we are again with AI. Massive interest. Trillions of dollars being sloshed around, everyone hustling to develop something so they'll get picked and flooded with cash. An enormous pile of deeply unethical and disrespectful behaviour by people who are doing what they're doing because that's where the money is. The AI bubble.

At present, problem one is almost entirely AI companies.

There's actually not much evidence of this, since the attack traffic is anonymous.

HN people working in these AI companies have commented to say they do this, and the timing correlates with the rise of AI companies/funding.

I haven't tried to find it in my own logs, but others have said blocking an identifiable AI bot soon led to the same pattern of requests continuing through a botnet.

Did HN people present evidence?

And a few decades ago, it would have been search engine scrapers instead.

And that problem was largely solved by robots.txt. AI scrapers are ignoring robots.txt and beating the hell out of sites. Small sites that have decades worth of quality information are suffering the most. Many of the scrapers are taking extreme measures to avoid being blocked, like using large numbers of distinct IP addresses (perhaps using botnets).

The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".

Freedom, the word, while implies no boundaries, is always bound by ethics, mutual respect and "do no harm" principle. The moment you trip either one of these wires and break them, the mechanisms to counter it becomes active.

Then we cry "but, freedom?!". Freedom also contains the consequences of one's actions.

Freedom without consequences is tyranny of the powerful.

The problem isn't "AI bot scraping while disregarding all licenses and ethical considerations". The problem is "AI bot scraping while ignoring every good practice to reduce bandwidth usage".

If you ask me "every good practice to reduce bandwidth usage" falls under ethics pretty squarely, too.

While this is certainly a problem, it's not the only problem.

> The problem is not AI bot scraping, per se, but "AI bot scraping while disregarding all licenses and ethical considerations".

What licenses? Free and open web. Go crazy. What ethical considerations? Do I police how users use the information on my site? No. If they make a pipe bomb using an 6502 CPU using code taken from my website -- am I supposed to do something about that?

Creative Commons, GFDL, Unlicense, GPL/AGPL, MIT, WTFPL. Go crazy. I have the freedom to police how users use the information on my site. Yes.

Real examples: My blog is BY-NC-SA and digital garden is GFDL. You can't take them, mangle and sell them. Especially, the blog.

AI companies take these posts, and sell derivatives, without any references, consent or compensation. BY-NC-SA is complete opposite of what they do.

This is why I'm not uploading any photos I take publicly anymore.

Absolutely. If you want to put all kinds of copyright, license, and even payment restrictions on your content go ahead. And if AI companies or people abuse that, that's bad on them.

But I do think if you're serious about free and open information than why are you doing that in the first place? It's perfectly reasonable to be restrictive; I write both very open software and very closed software. But I see a lot of people want to straddle the line when it comes to AI without a rational argument.

Let me try to make my point as compact as possible. I may fail, but please bear with me.

I prefer Free Software to Open Source software. My license of choice is A/GPLv3+. Because, I don't want my work to be used by people/entities in a single sided way. The software I put out is the software I develop for myself, with the hope of being useful for somebody else. My digital garden is the same. My blog is a personal diary in the open. These are built on my free time, for myself, and shared.

See, permissive licenses are for "developer freedom". You can do whatever you do with what you can grab, as long as you write a line to credits. A/GPL family is different. Wants reciprocity. It empowers the user vs. the developer. You have to give the source. Who modifies the source, shares the modifications. It stays in the open. It has to stay open.

I demand this reciprocity for what I put out there. The licenses reflect that. It's "restricting the use to keep the information/code open". I share something I spent my time on, and I want it to live on the open, want a little respect for putting out what I did. That respect is not fame or superiority. Just not take it and run with it, keeping all the improvements to yourself.

It's not yours, but ours. You can't keep it to yourself.

When it comes to AI, it's an extension of this thinking. I do not give consent to a faceless corporation to close, twist and earn money from what I put out for public good. I don't want a set of corporations act as a middleman to get what I put out, repackage and corrupt it in the process and sell it. It's not about money; it's about ethics, doing the right thing and being respectful. It's about exploitation. Same is applicable to my photos.

I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies. I equally get angry when a company's source available code is scraped and used for suggestions as well as an academic's LGPL high performance matrix library which is developed via grants over the years. This thing affect livelihoods of people.

I get angry when people say "if we take permission for what we do, AI industry will collapse", or "this thing just learns like humans, this is fair use".

I don't buy their "we're doing something awesome, we need no permission" attitude. No, you need permission to use my content. Because I say so. Read the fine print.

I don't want knowledge to be monopolized by these corporations. I don't want the small fish to be eaten by the bigger one and what remains is buried into the depths of information ocean.

This is why I stopped sharing my photos for now, and my latest research won't be open source for quite some time.

What I put out is for humans' direct consumption. Middlemen are not welcome.

If you have any questions or left any holes up there, please let me know.

I respect the desire for reciprocity, but strong copyleft isn't the only, or even the best, way to protect user freedom or public knowledge. My opinion is that permissive licensing and open access to learn from public materials have created enormous value precisely because they don't pre-empt future uses. Requiring permission for every new kind of reuse (including ML training) shrinks the commons, entrenches incumbents who already have data deals, and reduces the impact of your work. The answer to exploitation is transparency, attribution, and guardrails against republication, not copyright enforced restrictions.

I used to be much more into the GPL than I am now. Perhaps it was much more necessary decades ago or perhaps our fears were misguided. I license all my own stuff as Apache. If companies want to use it, great. It doesn't diminish what I've done. But those who prefer GPL, I completely understand.

> as well as an academic's LGPL high performance matrix library which is developed via grants over the years.

The academic got paid with grants. So now this high performance library exists in the world, paid for by taxes, but it can't be used everywhere. Why is it bad to share this with everyone for any purpose?

> What I put out is for humans' direct consumption. Middlemen are not welcome.

Why? Why must it be direct consumption? I've use AI tools to accomplish things that I wouldn't be able to do on my own in my free time -- work that is now open source. Tons of developers this week are benefiting from what I was able to accomplish using a middle man. Not all middlemen, by definition, are bad. Middlemen can provide value. Why is that value not welcome?

> I'm not against AI/LLM/Generative technology/etc. I'm against exploitation of people, artists, musicians, software developers, other companies.

If you define AI/LLM/Generative technology/etc as the exploitation of exploitation of people, artists, musicians, software developers, other companies then you are against it. As software developers our work directly affects the livelihoods of people. Everything we create is meant to automate some human task. To be a software developer and then complain that AI is going to take away jobs is to be a hypocrite.

Your whole argument is easily addressed by requiring the AI models to be open source. That way, they obviously respect the AGPL and any other open license, and contribute to the information being kept free. Letting these companies knowingly and obviously infringe licenses and all copyright as they do today is obviously immoral, and illegal.

AGPL doesn't pre-empt future uses or require permission for any kind of re-use. You just have to share alike. It's pretty simple.

AGPL lets you take a bunch of data and AI-train on it. You just have to release the data and source code to anyone who uses the model. Pretty simple. You don't have to rent them a bunch of GPUs.

Actually it can be annoying because of the specific mechanism by which you have to share alike - the program has to have a link to its own source code - you can't just offer the source alongside the binary. But it's doable.

How is it available for everyone if the AI bots bring down your server?

Is that really the problem we are discussing? I've had people attack my server and bring it down. But that has nothing to do with being free and open to everyone. A top hacker news post could take my server.

Yes, because a top hacker news post takes your server down because a large number of actual humans are looking to gain actual value from your posts. Meanwhile, you stand to benefit from the HN discussion by learning new things and perspectives from the community.

The AI bot assault, on the other hand, is one company (or a few companies) re-fetching the same data over and over again, constantly, in perpetuity, just in case it's changed, all so they can incorporate it into their training set and make money off of it while giving you zero credit and providing zero feedback.

But then we get to use those AI tools.

The refrain here comes down not to "AI" but mostly to "the AI bot assault" which is a different thing. Sure lets have an discussion about badly behaved and overzealous web scrapers. As for credit, I've asked AI for it's references and gotten them. If my information is merely mushed into AI training model I'm not sure why I need credit. If you discuss this thread with your friends are you going to give me credit?

No, you don't "get to" use the AI tools. You have to buy access to them (beyond some free trials).

Yes. I get to buy access to them. They're providing an expensive to provide service that requires specialized expertise. I don't see the problem with that.

"If you discuss this thread with your friends are you going to give me credit?"

Yes. How else would I enable my friends to look it up for themselves?

6 months from now when you've internalized this entire thread are you even going to remember where you got it from?

Why are you shifting the discussion by adding two new variables (time/memory)?

Because that's how one interacts with AI.

Yeah. Running out of arguments, are you?

[dead]

Ultimately, you have to realize that this is a losing battle, unless we have completely draconian control over every piece of silicon. Captchas are being defeated; at this point they're basically just mechanisms to prove you Really Want to Make That Request to the extent that you'll spend some compute time on it, which is starting to become a bit of a waste of electricity and carbon.

Talented people that want to scrape or bot things are going to find ways to make that look human. If that comes in the form of tricking a physical iPhone by automatically driving the screen physically, so be it; many such cases already!

The techniques you need for preventing DDoS don't need to really differentiate that much between bots and people unless you're being distinctly targeted; Fail2Ban-style IP bans are still quite effective, and basic WAF functionality does a lot.

Agreed, copyright issues need to be solved via legislation and network abuse issues need to be solved by network operators. Trying to run around either only makes the web worse for everyone.

Rate-limits? Use a CDN? Lots of traffic can be a problem whether it's bots or humans.

You realize this entire thread is about a pitch from a CDN company trying to solve an issue that has presented itself at such a scale that this is the best option they can think of to keep the web alive, right?

"Use a CDN" is not sufficient when these bots are so incredibly poorly behaved, because you're still paying for that CDN and this bad behavior is going to cost you a fortune in CDN costs (or cost the CDN a fortune instead, which is why Cloudflare is suggesting this).

Everyone can get it from the bots now?

Build better

[deleted]