Hacker News

Calavar 2 days ago [ - ]

I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch. It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center. Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

kelnos 2 days ago [ - ]

> robots.txt is a polite request to please not scrape these pages

People who ignore polite requests are assholes, and we are well within our rights to complain about them.

I agree that "theft" is too strong (though I think you might be presenting a straw man there), but "abuse" can be perfectly apt: a crawler hammering a server, requesting the same pages over and over, absolutely is abuse.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

That's a shitty world that we shouldn't have to live in.

wslh 2 days ago [ - ]

> People who ignore polite requests are assholes, and we are well within our rights to complain about them.

If you are building a new search engine and the robots.txt only include Google, are you an asshole indexing the information?

kijin 2 days ago [ - ]

Yes, because the site owner has clearly and explicitly requested that you don't scrape their site, fully accepting the consequence that their site will not appear in any search engine other than Google.

Whatever impact your new search engine or LLM might have in the world is irrelevant to their wishes.

DoctorOetker a day ago [ - ]

Whenever one forms a sentence, it is worthwhile to try to form a sentence that you believe to be generally true.

If someone politely requests you to suck their genitalia, and you ignore that request, does that make you an asshole?

smsm42 2 days ago [ - ]

"Theft" may be wrong, but "abuse" certainly is not. Human interactions in general, and the web in particular, are built on certain set of conventions and common behaviors. One of them is that most sites are for consuming information at human paces and volumes, not downloading their content wholesale. There are specialized sites that are fine with that, but they say it upfront. Average, especially hobbyist site, is not that. People who do not abide by it are certainly abusing it.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

Yes, and if the rule of not dumping a ton of manure on your driveway is so important to you, you should live in a gated community and hire round-the-clock security. Some people do, but living in a society where the only way to not wake up with a ton of manure in your driveway is to spend excessive resources on security is not the world that I would prefer to live in. And I don't see why people would spend time to prove this is the only possible and normal world - it's certainly not the case, we can do better.

o11c 2 days ago [ - ]

Theft is correct but for a different reason.

The #1 reason for all AI scrapers is to replace the content they are scraping. This means no "fair use" defense to the copyright infringement they inevitably commit.

watwut 2 days ago [ - ]

If you ignore polite request, then it is perfectly ok to give you as much false data as possible. You have shown yourself not interested in good faith cooperation, that means other people can and should treat you as a jerk.

bigiain 2 days ago [ - ]

> robots.txt is a polite request to please not scrape these pages

At the same time, an http GET request is a polite request to respond with the expects content. There is no binding agreement that my webserver sends you the webpage you asked for. I am at liberty to enforce my no-scraping rules however I see fit. I get to choose whether I'm prepared to accept the consequences of a "real user" tripping my web scraping detection thresholds and getting firewalled or served nonsense or zipbombed (or whatever countermeasure I choose). Perhaps that'll drive away a reader (or customer) who opens 50 tabs to my site all at once, perhaps Google will send a badly behaved bot and miss indexing some of my pages or even deindexing my site. For my personal site I'm 100% OK with those consequences. For work's website I still use countermeasures but set the thresholds significantly more conservatively. For production webapps I use different but still strict thresholds and different countermeasures.

Anybody who doesn't consider typical AI company's webscraping behaviour over the last few years to qualify as "abuse" has probably never been responsible for a website with any volume of vaguely interesting text or any reasonable number of backlinks from popular/respected sites.

overfeed 2 days ago [ - ]

It may be naivete, but I love the standards-based open web as a software platform and a s a fabric that connects people. O It makes my blood boil that some solipsistic, predatory bastards are eager to turn the internet into a dark forest

bigbuppo 2 days ago [ - ]

Seriously. Did you see what that web server was wearing? I mean, sure it said "don't touch me" and started screaming for help and blocked 99.9% of our IP space, but we got more and they didn't block that so clearly they weren't serious. They were asking for it. It's their fault. They're not really victims.

jMyles 2 days ago [ - ]

Sexual consent is sacred. This metaphor is in truly bad taste.

When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

jraph 2 days ago [ - ]

> When you return a response with a 200-series status code, you've granted consent. If you don't want to grant consent, change the logic of the server.

"If you don't consent to me entering your house, change its logic so that picking the door's lock doesn't let me open the door"

Yeah, well…

As if the LLM scrappers didn't try anything under the sun like using millions of different residential IP to prevent admins from "changing the logic of the server" so it doesn't "return a response with a 200-series status code" when they don't agree to this scrapping.

As if there weren't broken assumptions that make "When you return a response with a 200-series status code, you've granted consent" very false.

As if technical details were good carriers of human intents.

ryandrake 2 days ago [ - ]

The locked door is a ridiculous analogy when it comes to the open web. Pretty much all "door" analogies are flawed, but sure let's imagine your web server has a door. If you want to actually lock the door, you're more than welcome to put an authentication gate around your content. A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way.

jraph 2 days ago [ - ]

Any analogy is flawed and you can kill most analogies very fast. They are meant to illustrate a point hopefully efficiently, not to be mathematically true. They are not to everyone's taste, me included in most cases. They are mostly fine as long as they are not used to make a point, but only to illustrate it.

I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.

I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?

> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way

The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.

Edit:

> you're more than welcome to put an authentication gate around your content.

I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.

ryandrake 2 days ago [ - ]

People need to have a better mental model of what it means to host a public web site, and what they are actually doing when they run the web server and point it at a directory of files. They're not just serving those files to customers. They're not just serving them to members. They're not just serving them to human beings. They're not even necessarily serving files to web browsers. They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET. There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

Retric 2 days ago [ - ]

> They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET.

Legally in the US a “public” web server can have any set of usage restrictions it feels like even without a login screen. Private property doesn’t automatically give permission to do anything even if there happens to be a driveway from the public road into the middle of it.

The law cars about authorized access not the specific technical implementation of access. Which has caused serious legal trouble for many people when they make seemingly reasonable assumptions that say access to someURL/A12.jpg also gives them permission to someURL/A13.jpg etc.

jMyles 2 days ago [ - ]

...but the matter of "what the law cares about" is not really the point of contention here - what matters here is what happens in the real world.

In the real world, these requests are being made, and servers are generating responses. So the way to change that is to change the logic of the servers.

Retric 2 days ago [ - ]

> In the real world, these requests are being made, and servers are generating responses.

Except that’s not the end of the story.

If you’re running a scraper and risking serious legal consequences when you piss off someone running a server enough, then it suddenly matters a great deal independent of what was going on up to that point. Having already made these requests you’ve just lost control of the situation.

That’s the real world we’re all living in, you can hope the guy running a server is going to play ball but that’s simply not under your control. Which is the real reason large established companies care about robots.txt etc.

jraph 2 days ago [ - ]

> There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.

But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.

I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.

> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

It is not about denying access to the content to some and allowing access to others.

It is about having to deal with abuses.

Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.

tremon a day ago [ - ]

The CFAA wants to have a word. The fact that a server responds with a 200 OK has no bearing on the legality of your request, there's plenty of precedent by now.

oytis 2 days ago [ - ]

Technically, you are not serving anything - it's just voltage levels going up and down with no meaning at all.

bigbuppo 2 days ago [ - ]

How about AI companies just act ethically and obey norms?

jack_pp 2 days ago [ - ]

here's my analogy, it's like you own a museum and you require entrance by "secret" password (your user agent filtering or what not). the problem is the password is the same for everyone so would you be surprised when someone figures it out or gets it from a friend and they visit your museum? Either require a fee (processing power, captcha etc) or make a private password (auth)

It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.

jraph 2 days ago [ - ]

> Either require a fee (processing power, captcha etc) or make a private password (auth)

Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.

> It is inherently a cat and mouse game that you CHOOSE to play

No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.

jack_pp 2 days ago [ - ]

That's like saying you shouldn't have to sanitize your database inputs because you never asked for people to SQL inject your database. This stance is truly mind boggling to me

jraph 2 days ago [ - ]

Would you take the defense of attackers using SQL injections? Because it feels like people here, including you, are defending the llm scrapers against sysadmins and authors who dare share their work publicly.

Ensuring basic security and robustness of a piece of software is simply not remotely comparable to countering the abuse these llm companies carry on.

But it's not even the point. And preventing SQL injections (through healthy programming practices) doesn't make things worse for any legitimate user neither.

catlifeonmars 2 days ago [ - ]

It’s both. You should sanitize your inputs because there are bad actors, but you also categorize attempts to sql inject as abuse and there is legal recourse.

LexGray 2 days ago [ - ]

Perhaps bad taste, but bots could also be legitimately purposely violating the most private or traumatizing moments a vulnerable person has in any exploitative way it cares to. I am not sure using bad taste is enough of an excuse to not discuss the issue as many people do in fact use the internet for sexual things. If anything consent should be MORE important because it is easier to document and verify.

A vast hoard of personal information exists and most of it never had or will have proper consent, knowledge, or protection.

jMyles a day ago [ - ]

> the most private or traumatizing moments a vulnerable person has

...and in this hypothetical, this person is serving them via an unauthenticated http server and hoping that clients will respect robots.txt?

bigbuppo a day ago [ - ]

Robots are supposed to behave. It was a solved problem 30 years ago until AI bros unsolved it. Any entity that does not obey robots.txt is by definition a malicious actor.

mvc 2 days ago [ - ]

Future rapist right here.

2 days ago [ - ]

[deleted]

Larrikin 2 days ago [ - ]

[flagged]

bigbuppo a day ago [ - ]

They also conveniently missed the point that it was about victim blaming.

jack_pp 2 days ago [ - ]

if u absolutely want a sexual metaphor it's more like you snuck into the world record for how many sexual parteners a woman can take in 24h and even tho you aren't on the list you still got to smash.

solution is the same, implement better security

bigbuppo 2 days ago [ - ]

Thank you for finding the right metaphor. If there is a sign out front that has a list of individuals that should go away but they continue, they're in a lot of legal trouble. If they show a fake ID to the event organizers that are handling all the paperwork, that is also something that will land them in prison.

2 days ago [ - ]

[deleted]

hsbauauvhabzb 2 days ago [ - ]

How else do you tell the bot you do not wish to be scraped? Your analogy is lacking - you didn’t order a package, you never wanted a package, and the postman is taking something, not leaving it, and you’ve explicitly left a sign saying ‘you are not welcome here’.

Calavar 2 days ago [ - ]

If you are serving web pages, you are soliciting GET requests, kind of like ordering a package is soliciting a delivery.

"Taking" versus "giving" is neither here nor there for this discussion. The question is are you expressing a preference on etiquette versus a hard rule that must be followed. I personally believe robots.txt is the former, and I say that as someone who serves more pages than they scrape

yuliyp 2 days ago [ - ]

Having a front door physically allows anyone on the street to come to knock on it. Having a "no soliciting" sign is an instruction clarifying that not everybody is welcome. Having a web site should operate in a similar fashion. The robots.txt is the equivalent of such a sign.

halJordan 2 days ago [ - ]

No soliciting signs are polite requests that no one has to follow, and door to door salesman regularly walk right past them.

No one is calling for the criminalization of door-to-door sales and no one is worried about how much door-to-door sales increases water consumption.

ahtihn 2 days ago [ - ]

If a company was sending hundreds of salesmen to knock at a door one after the other, I'm pretty sure they could successfully get sued for harassment.

hsbauauvhabzb 2 days ago [ - ]

Can’t Americans literally shoot each other for trespassing?

dragonwriter 2 days ago [ - ]

Generally, legally, no, not just for ignoring a “no soliciting” sign.

hsbauauvhabzb 2 days ago [ - ]

But they’re presumably trespassing.

dragonwriter 2 days ago [ - ]

And, despite what ideas you may get from the media, mere trespass without imminent threat to life is not a justification for deadly force.

There are some states where the considerations for self defense do not include a duty to retreat if possible, either in general (“stand your ground" law) or specifically in the home (“castle doctrine"), but all the other requirements (imminent threat of certain kinds of serious harm, proportional force) for self-defense remain part of the law in those states, and trespassing by/while disregarding a ”no soliciting” would not, by itself, satisfy those requirements.

oytis 2 days ago [ - ]

> door to door salesman regularly walk right past them.

Oh, now I understand why Americans can't see a problem here.

duskdozer 2 days ago [ - ]

>No one is calling for the criminalization of door-to-door sales

Ok, I am, right now.

It seems like there are two sides here that are talking past one another: "people will do X and you accept it if you do not actively prevent it, if you can" and "X is bad behavior that should be stopped and shouldn't be the burden of individuals to stop". As someone who leans to the latter, the former just sounds like restating the problem being complained about.

distances 2 days ago [ - ]

> No one is calling for the criminalization of door-to-door sales

Door-to-door sales absolutely are banned in many jurisdictions.

czscout 2 days ago [ - ]

And a no soliciting sign is no more cosmically binding than robots.txt. It's a request, not an enforceable command.

hsbauauvhabzb 2 days ago [ - ]

Tell me you work in an ethically bankrupt industry without telling me you work in an ethically bankrupt industry.

andoando 2 days ago [ - ]

Well yes this is exactly what's happening as of now. But there SHOULD be a way to upload content without giving it access to scrapers.

munk-a 2 days ago [ - ]

I disagree strongly here - though not from a technical perspective. There's absolutely a legal concept of making your work available for viewing without making it available for copying and AI scraping (while we can technically phrase it as just viewing a bunch of times) is effectively copying.

Lets say a large art hosting site realizes how damaging AI training on their data can be - should they respond by adding a paywall before any of their data is visible? If that paywall is added (let's just say $5/mo) can most of the artists currently on their site afford to stay there? Can they afford it if their potential future patrons are limited to just those folks who can pay $5/mo? Would the scraper be able to afford a one time cost of $5 to scrape all of that data?

I think, as much they are a deeply flawed concept, this is a case where EULAs or an assumption of no-access for training unless explicitly granted that's actually enforced through the legal system is required. There are a lot of small businesses and side projects that are dying because of these models and I think that creative outlet has societal value we would benefit from preserving.

jMyles 2 days ago [ - ]

> There's absolutely a legal concept of making your work available for viewing without making it available for copying

This "legal concept" is enforceable through legacy systems of police and violence. The internet does not recognize it. How much more obvious can this get?

If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

kelnos 2 days ago [ - ]

> legacy systems of police and violence

You use "legacy" as if these systems are obsolete and on their way out. They're not. They're here to stay, and will remain dominant, for better or worse. Calling them "legacy" feels a bit childish, as if you're trying to ignore reality and base arguments on your preferred vision of how things should be.

> The internet does not recognize it.

Sure it does. Not universally, but there are a lot of things governments and law enforcement can do to control what people see and do on the internet.

> If we stumble down the path of attempting to apply this legal framework, won't some jurisdiction arise with no IP protections whatsoever and just come to completely dominate the entire economy of the internet?

No, of course not, that's silly. That only really works on the margins. Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.

> If I can spin up a server in copyleftistan with a complete copy of every album and film ever made, available for free download, why would users in copyrightistan use the locked down services of their domestic economy?

Because the governments of all the copyrightistans will block all traffic going in and out of copyleftistan. While this may not stop determined, technically-adept people, it will work for the most part. As I said, this sort of thing only really works on the margins.

jMyles 2 days ago [ - ]

I guess I'm more optimistic about the future of the human condition.

> You use "legacy" as if these systems are obsolete and on their way out. They're not.

I have serious doubts that nation states will still exist in 500 years. I feel quite certain that they'll be gone in 10,000. And I think it's generally good to build an internet for those time scales.

> base arguments on your preferred vision of how things should be.

I hope we all build toward our moral compass; I don't mean for arguments to fall into fallacies on this basis, but yeah I think our internet needs to resilient against the waxing and waning of the affairs of state. I don't know if that's childish... Maybe we need to have a more child-like view of things? The internet _is_ a child in the sense of its maturation timeframe.

> there are a lot of things governments and law enforcement can do to control what people see and do on the internet.

Of course there are things that governments do. But are they effective? I just returned from a throatsinging retreat in Tuva - a fairly remote part of Siberia. The Russian government has apparently quietly begun to censor quite a few resources on the internet, and it has caused difficulty in accessing the traditional music of the Tuvan people. And I was very happily astonished to find that everybody to whom I ran into, including a shaman grandmother, was fairly adept at routing around this censorship using a VPN and/or SSH tunnel.

I think the internet is doing a wonderful job at routing around censorship - better than any innovation ever discovered by humans so far.

> Any other country would immediately slap economic sanctions on that free-for-all jurisdiction and cripple them. If that fails, there's always a military response they can resort to.

Again, maybe I'm just more optimistic, but I think that on longer time frames, the sober elder statesmen/women will prevail and realize that violence is not an appropriate response to bytes transiting the wire that they wish weren't.

And at the end of the day, I don't think governments even have the power here - the content creators do. I distribute my music via free channels because that's the easiest way to reach my audience, and because, given the high availability of compelling free content, there's just no way I can make enough money on publishing to even concern myself with silly restrictions.

It seems to me that I'm ahead of the curve in this area, not behind it. But I'm certainly open to being convinced otherwise.

dns_snek 2 days ago [ - ]

> Again, maybe I'm just more optimistic, but I think that on longer time frames, the sober elder statesmen/women will prevail and realize that violence is not an appropriate response to bytes transiting the wire that they wish weren't.

Your framing is off because this notion of fairness or morality isn't something they concern themselves with. They're using violence because if they didn't, they would be allowing other entities to gain wealth and power at their expense. I don't think it's much more complex than that.

See how differently these same bytes are treated in the hands of Aaron Swartz vs OpenAI. One threatened to empower humanity at the expense of reducing profits for a few rich men, so he got crucified for it. The other is hoping to make humans redundant, concentrate the distribution of wealth even further, and strengthen the US world dominance, so all of the right wheels get greased for them and they get a license to kill - figuratively and literally.

jMyles a day ago [ - ]

I mean... I agree with everything you've said here. I'm not sure what makes you think I've mis-framed the stakes.

davesque 2 days ago [ - ]

If I order a package from a company selling a good, am I inviting all that company's competitors to show up at my doorstep to try and outbid the delivery person from the original company when they arrive, and maybe they all show up at the same time and cause my porch to collapse? No, because my front porch is a limited resource for which I paid for an intended purpose. Is it illegal for those other people to show up? Maybe not by the letter of the law.

kelnos 2 days ago [ - ]

> If you are serving web pages, you are soliciting GET requests

So what's the solution? How do I host a website that welcomes human visitors, but rejects all scrapers?

There is no mechanism! The best I can do is a cat-and-mouse arms race where I try to detect the traffic I don't want, and block it, while the people generating the traffic keep getting more sophisticated about hiding from my detection.

No, putting up a paywall is not a reasonable response to this.

> The question is are you expressing a preference on etiquette versus a hard rule that must be followed.

Well, there really aren't any hard rules that must be followed, because there are no enforcement mechanisms outside of going nuclear (requiring login). Everything is etiquette. And I agree that robots.txt is also etiquette, and it is super messed up that we tolerate "AI" companies stomping all over that etiquette.

Do we maybe want laws that say everyone must respect robots.txt? Maybe? But then people will just move their scrapers to a jurisdiction without those laws. And I'm sure someone could make the argument that robots.txt doesn't apply to them because they spoofed a browser user-agent (or another user-agent that a site explicitly allows). So perhaps we have a new mechanism, or new laws, or new... something.

But this all just highlights the point I'm making here: there is no reasonable mechanism (no, login pages and http auth don't count) for site owners to restrict access to their site based on these sorts of criteria. And that's a problem.

pluto_modadic 2 days ago [ - ]

ignoring a rate limit gets you blocked.

hsbauauvhabzb 2 days ago [ - ]

Scrapers actively bypass this by rotating IP addresses.

stray 2 days ago [ - ]

You require something the bot won't have that a human would.

Anybody may watch the demo screen of an arcade game for free, but you have to insert a quarter to play — and you can have even greater access with a key.

> and you’ve explicitly left a sign saying ‘you are not welcome here’

And the sign said "Long-haired freaky people Need not apply" So I tucked my hair up under my hat And I went in to ask him why He said, "You look like a fine upstandin' young man I think you'll do" So I took off my hat and said, "Imagine that Huh, me workin' for you"

michaelt 2 days ago [ - ]

> You require something the bot won't have that a human would.

Is this why the “open web” is showing me a captcha or two, along with their cookie banner and newsletter pop up these days?

bigbuppo a day ago [ - ]

Up until people started making a big stink about CAPTCHAs being used for unpaid labor at scale, uh, well they had two purposes.

bakql 2 days ago [ - ]

Stop your http server if you do not wish to receive http requests.

vkou 2 days ago [ - ]

Turn off your phone if you don't want to receive robo-dialed calls and unsolicited texts 300 times a day.

Fence off your yard if you don't want people coming by and dumping a mountain of garbage on it every day.

You can certainly choose to live in a society that thinks these are acceptable solutions. I think it's bullshit, and we'd all be better off if anyone doing these things would be breaking rocks with their teeth in a re-education camp, until they learn how to be a decent human being.

bigbuppo a day ago [ - ]

Ah yes, and unplug the mail server to stop all spam. Great idea!

davsti4 2 days ago [ - ]

Its simple, and I'll quote myself - "robots.txt isn't the law".

smsm42 4 hours ago [ - ]

"I will do anything that will not literally get me into jail" is a pretty low bar. Most decent people try to do better than that - and that's the only reason society still exists, because there's not enough cops to put all bad people into jail and never will be.

ColinWright 2 days ago [ - ]

Quoting Cervisia :

> robots.txt. This is not the law

In Germany, it is the law. § 44b UrhG says (translated):

(1) Text and data mining is the automated analysis of one or more digital or digitized works to obtain information, in particular about patterns, trends, and correlations.

(2) Reproductions of lawfully accessible works for text and data mining are permitted. These reproductions must be deleted when they are no longer needed for text and data mining.

(3) Uses pursuant to paragraph 2, sentence 1, are only permitted if the rights holder has not reserved these rights. A reservation of rights for works accessible online is only effective if it is in machine-readable form.

-- https://news.ycombinator.com/item?id=45776825

bigbuppo a day ago [ - ]

Violating norms makes you an abusive jerk at best.

nkrisc 2 days ago [ - ]

Put your content behind authentication if you don’t want it to be requested by just anyone.

kelnos 2 days ago [ - ]

But I do want my content accessible to "just anyone", as long as they are humans. I don't want it accessible to bots.

You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

9rx 2 days ago [ - ]

> as long as they are humans. I don't want it accessible to bots.

A curious position. There isn't a secondary species using the internet. There is only humans. Unless you foresee some kind of alien invasion or earthworm uprising, nothing other than humans will ever access your content. Rejecting the tools humans use to bridge their biological gaps is rather nonsensical.

> You are free to say "well, there is no mechanism to do that", and I would agree with you. That's the problem!

I suppose it would be pretty neat if humans were born with some kind of internet-like telepathy ability, but lacking that mechanism isn't any kind of real problem. Humans are well adept at using tools and have successfully used tools for millennia. The internet itself is a tool! Which, like before, makes rejecting the human use of tools nonsensical.

2 days ago [ - ]

[deleted]

nkrisc 2 days ago [ - ]

Even abusive crawlers and scrapers are acting as agents of real humans, just as your browser is acting as your agent. I don't even know how you could reliably draw a reasonable line in the sand between the two without putting some group of people on the wrong side of the line.

I suppose the ultimate solution would be browsers and operating systems and hardware manufacturers co-operating to implement some system that somehow cryptographically signs HTTP requests which attests that it was triggered by an actual, physical interaction with a computing device by a human.

Though you don't have to think for very long to come up with all kinds of collateral damage that would cause and how bad actors could circumvent it anyway.

All in all, this whole issue seems more like a legal problem than a technical one.

bigbuppo a day ago [ - ]

Or the AI people could just stop being abusive jerks. That's an even easier solution.

9rx a day ago [ - ]

While that is probably good advice in general, the earlier commenter wanted even the abusive jerks to have access to his content.

He just doesn't want tools humans use to access content to be used in association with his content.

What he failed to realize is that if you eliminate the tools, the human cannot access the content anyway. They don't have the proper biological interfaces. Had he realized that, he'd have come to notice that simply turning off his server fully satisfies the constraints.

nkrisc a day ago [ - ]

That would be easier. Too bad it won't ever happen.

1gn15 2 days ago [ - ]

What the hell? That is incredibly discriminatory. Fuck off. I support those that counter those discriminatory mechanisms.

Anamon 2 days ago [ - ]

Discriminatory against bots? That doesn't even make any sense.

bigbuppo a day ago [ - ]

They probably have stock options.

grayhatter 2 days ago [ - ]

> I agree. It always surprises me when people are indignant about scrapers ignoring robots.txt and throw around words like "theft" and "abuse."

This feels like the kind of argument some would make as to why they aren't required to return their shopping cart to the bay.

> robots.txt is a polite request to please not scrape these pages because it's probably not going to be productive. It was never meant to be a binding agreement, otherwise there would be a stricter protocol around it.

Well, no. That's an overly simplistic description which fits your argument, but doesn't accurately represent reality. yes, robots.txt is created as a hint for robots, a hint that was never expected to be non-binding, but the important detail, the one that is important to understanding why it's called robots.txt is because the web server exists to serve the requests of humans. Robots are welcome too, but please follow these rules.

You can tell your description is completely inaccurate and non-representative of the expectations of the web as a whole. because every popular llm scraper goes out of their way to both follow and announce that they follow robots.txt.

> It's kind of like leaving a note for the deliveryman saying please don't leave packages on the porch.

It's nothing like that, it's more like a note that says no soliciting, or please knock quietly because the baby is sleeping.

> It's fine for low stakes situations, but if package security is of utmost importance to you, you should arrange to get it certified or to pick it up at the delivery center.

Or, people could not be assholes? Yes, I get it, the reality we live in there are assholes. But the problem as I see it, is not just the assholes, but the people who act as apologists for this clearly deviant behavior.

> Likewise if enforcing a rule of no scraping is of utmost importance you need to require an API token or some other form of authentication before you serve the pages.

Because it's your fault if you don't, right? That's victim blaming. I want to be able to host free, easy to access content for humans, but someone with more money, and more compute resources than I have, gets to overwhelm my server because they don't care... And that's my fault, right?

I guess that's a take...

There's a huge difference between suggesting mitigations for dealing with someone abusing resources, and excusing the abuse of resources, or implying that I should expect my server to be abused, instead of frustrated about the abuse.

mxkopy 2 days ago [ - ]

The metaphor doesn’t work. It’s not the security of the package that’s in question, but something like whether the delivery person is getting paid enough or whether you’re supporting them getting replaced by a robot. The issue is in the context, not the protocol.

whimsicalism 2 days ago [ - ]

There's an evolving morality around the internet that is very, very different from the pseudo-libertarian rule of the jungle I was raised with. Interesting to see things change.

sethhochberg 2 days ago [ - ]

The evolutionary force is really just "everyone else showed up at the party". The Internet has gone from a capital-I thing that was hard to access, to a little-i internet that was easier to access and well known but still largely distinct from the real world, to now... just the real world in virtual form. Internet morality mirrors real world morality.

For the most part, everybody is participating now, and that brings all of the challenges of any other space with everyone's competing interests colliding - but fewer established systems of governance.

hdgvhicv 2 days ago [ - ]

Based on the comments here the polite world of the internet where people obeyed unwritten best practices is certainly over in favour of “grab what you can might makes right”

whimsicalism 2 days ago [ - ]

that was never the internet. the old internet was “information wants to be free, good luck if you want to restrict my access or resharing”

bigbuppo a day ago [ - ]

You're very much wrong. Two of the key tennets of libertarianism is that your rights end where my nose begins and the respect of property rights . Your AI bot is causing problems for me, then you should be compensating me for the damage or other expense you caused. But the AI bros think they should be able to take anything they want whenever they want without compensation, and they'll use every single shady behavior they can to make that happen. In other words, they're robber barrons.