Hacker News

BrenBarn 6 days ago [ - ]

No, that is not true. It is only true if you just equate "AI training bots" with "people" on some kind of nominal basis without considering how they operate in practice.

It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".

It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.

This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.

wvenable 6 days ago [ - ]

You have a problem with badly behaved scrapers, not AI.

I can't disagree with being against badly behaved scrapers. But this is neither a new problem or an interesting one from the idea of making information freely available to everyone, even rhinoceroses, assuming they are well behaved. Blocking bad actors is not the same thing as blocking AI.

BrenBarn 5 days ago [ - ]

The thing is that rhinoceroses aren't well-behaved. Even if some small fraction of them in theory might be well-behaved, the effort of trying to account for that is too small to bother. If 99% of rhinoceroses aren't well-behaved, the simple and correct response is to ban them all, and then maybe the nice ones can ask for a special permit. You switch from allow-by-default to block-by-default.

Similarly it doesn't make sense to talk about what happens if AI bots were well-behaved. If they are, then maybe that would be okay, but they aren't, so we're not talking about some theoretical (or past) situation where bots were well-behaved and scraped in a non-disruptive fashion. We're talking about the present reality in which there actually are enormous numbers of badly-behaved bots.

Incidentally, I see that in a lot of your responses on this thread you keep suggesting that people's problem is "not with AI" but with something else. But look at your comment that I initially replied to:

> Blocking AI training bots is not free and open for all.

We're not talking about "AI". We're talking about AI training bots. If people want to develop AI as a theoretical construct and train it on datasets they download separately in a non-disruptive way, great. (Well, actually it's still terrible, but for other reason. :-) ) But that's not what people are responding to in this thread. They're talking about AI training bots that scrape websites in a way that is objectively more harmful than previous generations of scrapers.

estimator7292 5 days ago [ - ]

A rhino can't not be huge and destructive and humans can't not be shitty and selfish. Badly behaved scrapers are simply an inevitable fact of the universe and there's no point trying to do anything because it's an immutable law of reality and can never be changed, so don't bother to try

msgodel 5 days ago [ - ]

ISPs are supposed to disconnect abusive customers. The correct thing to do is probably contact the ISP. Don't complain about scraping, complain about the DDOS (which is the actual problem and I'm increasingly beginning to believe the intent.)

crote 5 days ago [ - ]

Great! How do I get, say, Google's ISP to disconnect them?

ranger_danger 5 days ago [ - ]

Every ISP has an abuse email contact you can look up.

ManlyBread 5 days ago [ - ]

Sure, let me just contact that one ISP located in Russia or India, I am sure they will care a lot about my self-hosted blog

lotsofpulp 5 days ago [ - ]

Hence the need for Cloudflare?

ManlyBread 3 days ago [ - ]

I am not comfortable with a private company being the only solution, especially when they have a history of deplatforming sites.

account42 3 days ago [ - ]

Except that's exactly what you should do. And if they refuse to cooperate you contact the network operators between them and yourself.

Imagine if Chinese or Russian criminal gangs started sending mail bombs to the US/EU and our solution would be to require all senders, including domestic ones, to prove their identity in order to have their parcels delivered. Completely absurd, but somehow with the Internet everyone jumps to that instead of more reasonable solutions.

ManlyBread 3 days ago [ - ]

The internet is not a mirror of the real world.

SamBam 6 days ago [ - ]

But many people feel that the very act of incorporating your copyrighted words into their for-profit training set is itself the bad behavior. It's not about rate-limiting scrapers, it's letting them in the door in the first place.

PaulHoule 6 days ago [ - ]

Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?

My Ithaca friends on Facebook complain incessantly about the very existence of AI to the extent that I would not want to say I ask Copilot how to use Windows Narrator or Junie where the CSS that makes this text bold or sometimes have Photoshop draw an extra row of bricks in a photograph for me.

The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.

chongli 6 days ago [ - ]

They were okay with it when Google was sending them traffic. Now they often don’t. They’ve broken the social contract of the web. So why should the sites whose work is being scraped be expected to continue upholding their end?

danudey 6 days ago [ - ]

Not only are they scraping without sending traffic, they're doing so much more aggressively than Google ever did; Google, at least, respected robots.txt and kept to the same user-agent. They didn't want to index something that a server didn't want indexed. AI bots, on the other hand, want to index every possible thing regardless of what anyone else says.

fny 6 days ago [ - ]

There's something more obviously nefarious and existential about AI. It takes the idea of "you are the product" to a whole new level.

BrenBarn 6 days ago [ - ]

> Why was it OK for Google to incorporate their words into a for-profit search index which has increasingly sucked all the profit out of the system?

It wasn't okay, it's just that the reasons it wasn't okay didn't become apparent until later.

> The same people seem to have no problem with Facebook using their words for all things Facebook uses them for, however.

Many of those people will likely have a problem with it later, for reasons that are happening now but that they won't become fully aware of until later.

novemp 6 days ago [ - ]

[flagged]

PaulHoule 6 days ago [ - ]

[flagged]

novemp 6 days ago [ - ]

[flagged]

wvenable 6 days ago [ - ]

Sure. But we're already talking about presumption of free and open here. I'm sure people are also reading my words and incorporating it into their own for-profit work. If I cared, I wouldn't make it free and open in the first place.

account42 3 days ago [ - ]

But that is not something you can protect against with technical means. At beast you can block the little fish and give even more power to the mega corporations who will always have a way to get to the data - either by operating crawlers you cannot afford to block, incentivizing users to run their browsers and/or extensions that collect the data and/or buying the data from someone who does.

All you end up doing is participating in the enshittification of the web for the rest of us.

danudey 6 days ago [ - ]

Badly behaved scrapers are not a new problem, but badly behaved scrapers run by multibillion-dollar companies which use every possible trick to bypass every block or restriction or rate limit you put in front of them is a completely new problem on a scale we've never seen before.

spankibalt 6 days ago [ - ]

> You have a problem with badly behaved scrapers, not AI.

And you have a problem understanding that "freedom and openness" extend only to where the rights (e. g. the freedom) of another legal entity begins. When I don't want "AI" (not just the badly-behaved subset) rifling my website then I should be well within my rights to disallow just that, in the same way as it's your right to allow them access to your playground. It's not rocket science.

lostmsu 5 days ago [ - ]

This is not what the parent means. What they mean is such behavior is a hypocrisy. Because you are getting access to truly free websites whose owners are interested in having smart chatbots trained on the free web, but you are blocking said chatbots while touting "free Internet" message.

johnnienaked 4 days ago [ - ]

AI is one of those bad actors

account42 3 days ago [ - ]

But we should also not throw out the baby with the bathwater. All these attempts at blocking AI bots also block other kinds of crawlers as well as real users with niche browsers.

Meanwhile if you are concerned with the parasitic nature of AI companies then no technical measure will solve that. As you have already noted, they can just buy your data from someone else who you can't afford to block - Google, users with a browser extension that records everything, bots that are ahead of you in the game of cat and mouse, etc.

akoboldfrying 5 days ago [ - ]

> "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?"

What this scenario actually reveals is that the words "open to the public" are not intended to mean "access is completely unrestricted".

It's fine to not want to give completely unrestricted access to something. What's not fine, or at least what complicates things unnecessarily, is using words like "open and free" to describe this desired actually-we-do-want-to-impose-certain-unstated-restrictions contract.

I think people use words like "open and free" to describe the actually-restricted contracts they want to have because they're often among like-minded people for whom these unstated additional restrictions are tacitly understood -- or, simply because it sounds good. But for precise communication with a diverse audience, using this kind of language is at best confusing, at worst disingenuous.

crote 5 days ago [ - ]

Nobody has ever meant "access is completely unrestricted".

As a trivial example: what website is going to welcome DDoS attacks or hacking attempts with open arms? Is a website no longer "open to the public" if it has DDoS protection or a WAF? What if the DDoS makes the website unavailable to the vast majority of users: does blocking the DDoS make it more or less open?

Similarly, if a concert is "open to the public", does that mean they'll be totally fine with you bringing a megaphone and yelling through the performance? Will they be okay with you setting the stage on fire? Will they just stand there and say "aw shucks" if you start blocking other people from entering?

You can try to rules-lawyer your way around commonly-understood definitions, but deliberately and obtusely misinterpreting such phrasing isn't going to lead to any kind of productive discussion.

akoboldfrying 5 days ago [ - ]

>You can try to rules-lawyer your way around commonly-understood definitions

Despite your assertions to the contrary, "actually free to use for any purpose" is a commonly understood interpretation of "free to use for any purpose" -- see permissive software licenses, where licensors famously don't get to say "But I didn't mean big companies get to use it for free too!"

The onus is on the person using a term like "free" or "open" to clarify the restrictions they actually intend, if any. Putting the onus anywhere else immediately opens the way for misunderstandings, accidental or otherwise.

To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert. They do only the things an ordinary member of the public do; they can't do anything else. The most "damage" they can do is to keep humans who would enjoy the concert from being able to attend if there aren't enough seats; whatever additional costs they cause (air conditioning, let's say) are the same as the costs that would have been incurred by that many humans.

amiga386 5 days ago [ - ]

> To make your concert analogy actually fit: A scraper is like a company that sends 1000 robots with tape recorders to your "open to the public" concert.

The scraper is sending ten million robots to your concert. They're packing out every area of space, they're up on the stage, they're in all the vestibules and toilets even though they don't need to go. They've completely crowded out all the humans, who were the ones who actually need to see the concert.

You'd have been fine with a few robots. It used to be the case that companies would send one robot each, and even though they were videotaping, they were discreet about it and didn't get in the humans way.

Now some imbecile is sending millions of robots, instead of just one with a video camera. All the robots wear the scraper's company uniform at first, so to deal with this problem you tell all robots wearing it to go home. Then they all come back dressed identically to the humans in the queue, as they jump ahead of them, to deliberately disguise who they are because they know you'll kick them out. They're not taking no for an answer, and they're going to use their sheer mass and numbers to block out your concert. Nobody seems to know why they do it, and nobody knows who is sending the robots for sure, because robot owners are all denying it's theirs. But somebody is sending them.

BrenBarn 5 days ago [ - ]

Using "open and free" to mean "I actually want no restrictions at all" is also confusing and disingenuous, because, as you yourself point out, a lot of people don't mean that by those words.

The other thing, though, is that there's a difference between "I personally want to release my personal work under open, free, and unrestricted terms" and "I want to release my work into a system that allows people to access information in general under open, free, and unrestricted terms". You can't just look at the individual and say "Oh, well, the conditions you want to put on your content mean it's not open and free so you must not actually want openness and freedom". You have to look at the reality of the entire system. When bots are overloading sites, when information is gated behind paywalls, when junk is firehosed out to everyone on behalf of paid advertisers while actual websites are down on page 20 of the search results, the overall situation is not one of open and free information exchange, and it's naive to think that individuals simply dumping their content "openly and freely" into this environment is going to result in an open and free situation.

Asking people to just unilaterally disarm by imposing no restrictions, while other less noble actors continue to impose all sorts of restrictions, will not produce a result that is free of restrictions. In fact quite the opposite. In order to actually get a free and open world in the large, it's not sufficient for good actors to behave in a free and open manner. Bad actors also must be actively prevented from behaving in an unfree and closed manner. Until they are, one-sided "gifts" of free and open content by the good actors will just feed the misdeeds of the bad actors.

akoboldfrying 5 days ago [ - ]

> Asking people to just unilaterally disarm by imposing no restrictions

I'm not asking for this. I'm asking for people who want such restrictions (most of which I consider entirely reasonable) to say so explicitly. It would be enough to replace words like "free" or "open" with "fair use", which immediately signals that some restrictions are intended, without getting bogged down in details.

BrenBarn 5 days ago [ - ]

Why? It seems you already know what people mean by "open and free", and it does have a connection to the ideals of openness and freedom, namely in the systemic context that I described above. So why bother about the terminology?

akoboldfrying 4 days ago [ - ]

What people mean by words like "open" and "free" varies. It varies a lot, and a lot turns on what they actually mean.

The only sensible way forward is to be explicit.

Why fight this obvious truth? Why does it hurt so much to say what you mean?

RobSm 5 days ago [ - ]

You can always stop bots. Add login/password. But people want their content to be accessible to as large audience as possible, but at the same time they don't want that data to be accessible to the same audience via other channels. logic. Bots are not consuming your data - humans are. At the end of the day humans will eventually read it and take actions. For example chatgpt will mention your site, the user will visit it.

And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.

lucumo 5 days ago [ - ]

> And no, nothing was different before 2022. Just look at google, the largest bot scraping network in the world. Since 1996.

I'm sorry, but this statement shows you have no recent experience with these crawlernets.

Google, from the beginning, has done their best to work with server owners. They respect robots.txt. I think they were the first to implement Crawl-Delay. They crawl based on how often things change anyway. They have an additional safeguard that when they notice a slowdown in your responses, they back off.

Compare this with Anthropic. On their website they say they follow robots.txt and Crawl-Delay. I have an explicit ban on Claudebot in there and a Crawl-Delay for everyone else. It ignores both. I send an email to them about this, and their answer didn't address the discrepancy between the docs and the behaviour. They just said they'll add me to their internal whitelist and that I should've sent 429s when they were going too fast. (Fuck off, how about you follow your public documentation?)

That's just my experience, but if you Google around you'll find that Anthropic is notorious for ignoring robots.txt.

And still, Claudebot is one of the better behaved bots. At least they identify themselves, have a support email they respond to, and use identifiable IP-addresses.

A few weeks ago I spend four days figuring out why I had 20x the traffic I normally have (which maxed out the server; causing user complaints). Turns out there are parties that crawl using millions of (residential) IPs and identify themselves as normal browsers. Only 1 or 2 connections per IP at the time. Randomization of identifying properties. Even Anthropics 429 solution wouldn't have worked there.

I managed to find a minor identifying property in some of the requests that wasn't catching too many real users. I used that to start firewalling IPs on sight and then their own randomization caused every IP to fall into the trap in the end. But it took days.

In the end I had to firewall nearly 3 million non-consecutive IP addresses.

So no, Google in 1996 or 2006 or 2016 is not the same as the modern DDoSing crawlernet.

ijk 5 days ago [ - ]

I am still a bit confused by what some of these crawlers are getting out of it; repeatedly crawling sites that haven't changed seems to be the norm for the current crawlernets, which seems like a massive waste of resources on their end for what is, on average, data of rather indifferent quality.

immibis 5 days ago [ - ]

Nothing. They're not designed to be useful. They're designed to grab as much data as possible and they'll figure out what to do with it later - they don't know it's mostly useless yet.

Tarpits are cool.

account42 3 days ago [ - ]

Did you send any abuse reports to the ASNs for those IP addresses?