Hacker News

Any analogy is flawed and you can kill most analogies very fast. They are meant to illustrate a point hopefully efficiently, not to be mathematically true. They are not to everyone's taste, me included in most cases. They are mostly fine as long as they are not used to make a point, but only to illustrate it.

I agree with this criticism of this analogy, I actually had this flaw in mind from the start. There are other flaws I have in mind as well.

I have developed more without the analogy in the remaining of the comment. How about we focus on the crux of the matter?

> A web server that accepts a GET request and replies 2xx is distinctly NOT "locked" in any way

The point is that these scrappers use tricks so that it's difficult not to grant them access. What is unreasonable here is to think that 200 means consent, especially knowing about the tricks.

Edit:

> you're more than welcome to put an authentication gate around your content.

I don't want to. Adding auth so llm providers don't abuse my servers and the work I meant to share publicly is not a working solution.

People need to have a better mental model of what it means to host a public web site, and what they are actually doing when they run the web server and point it at a directory of files. They're not just serving those files to customers. They're not just serving them to members. They're not just serving them to human beings. They're not even necessarily serving files to web browsers. They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET. There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

As the web server operator, you can try to figure out if there's a human behind the IP, and you might be right or wrong. You can try to figure out if it's a web browser, or if it's someone typing in curl from a command line, or if it's a massively parallel automated system, and you might be right or wrong. You can try to guess what country the IP is in, and you might be right or wrong. But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

ryandrake 2 days ago [ - ]

Retric 2 days ago [ - ]

> They're serving files to every IP address (no matter what machine is attached to it) that is capable of opening a socket and sending GET.

Legally in the US a “public” web server can have any set of usage restrictions it feels like even without a login screen. Private property doesn’t automatically give permission to do anything even if there happens to be a driveway from the public road into the middle of it.

The law cars about authorized access not the specific technical implementation of access. Which has caused serious legal trouble for many people when they make seemingly reasonable assumptions that say access to someURL/A12.jpg also gives them permission to someURL/A13.jpg etc.

jMyles 2 days ago [ - ]

...but the matter of "what the law cares about" is not really the point of contention here - what matters here is what happens in the real world.

In the real world, these requests are being made, and servers are generating responses. So the way to change that is to change the logic of the servers.

> In the real world, these requests are being made, and servers are generating responses.

Except that’s not the end of the story.

If you’re running a scraper and risking serious legal consequences when you piss off someone running a server enough, then it suddenly matters a great deal independent of what was going on up to that point. Having already made these requests you’ve just lost control of the situation.

That’s the real world we’re all living in, you can hope the guy running a server is going to play ball but that’s simply not under your control. Which is the real reason large established companies care about robots.txt etc.

jraph 2 days ago [ - ]

> There's no such distinct thing as a scraper--and if your mental model tries to distinguish between a scraper and a human user, you're going to be disappointed.

I disagree. If your mental model doesn't allow conceptualizing (abusive) scrapers, it is too simplicistic to be useful to understand and deal with reality.

But I'd like to re-state the frame / the concern: it's not about any bot or any scraper, it is about the despicable behavior of LLM providers and their awful scrappers.

I'm personally fine with bots accessing my web servers, there are many legitimate use cases for this.

> But if you really want to actually limit access to the content, you shouldn't be publishing that content publicly.

It is not about denying access to the content to some and allowing access to others.

It is about having to deal with abuses.

Is a world in which people stop sharing their work publicly because of these abuses desirable? Hell no.

tremon a day ago [ - ]

The CFAA wants to have a word. The fact that a server responds with a 200 OK has no bearing on the legality of your request, there's plenty of precedent by now.

oytis 2 days ago [ - ]

Technically, you are not serving anything - it's just voltage levels going up and down with no meaning at all.

bigbuppo 2 days ago [ - ]

How about AI companies just act ethically and obey norms?

jack_pp 2 days ago [ - ]

here's my analogy, it's like you own a museum and you require entrance by "secret" password (your user agent filtering or what not). the problem is the password is the same for everyone so would you be surprised when someone figures it out or gets it from a friend and they visit your museum? Either require a fee (processing power, captcha etc) or make a private password (auth)

It is inherently a cat and mouse game that you CHOOSE to play. Either implement throttling for clients that consume too much resources for your server / require auth / captcha / javascript / whatever whenever the client is using too much resources. if the client still chooses to go through the hoops you implemented then I don't see any issue. If u still have an issue then implement more hoops until you're satisfied.

> Either require a fee (processing power, captcha etc) or make a private password (auth)

Well, I shouldn't have to work or make things worse for everybody because the LLM bros decided to screw us.

> It is inherently a cat and mouse game that you CHOOSE to play

No, let's not reverse the roles and blame the victims here. We sysadmins and authors are willing to share our work publicly to the world but never asked for it to be abused.

That's like saying you shouldn't have to sanitize your database inputs because you never asked for people to SQL inject your database. This stance is truly mind boggling to me

Would you take the defense of attackers using SQL injections? Because it feels like people here, including you, are defending the llm scrapers against sysadmins and authors who dare share their work publicly.

Ensuring basic security and robustness of a piece of software is simply not remotely comparable to countering the abuse these llm companies carry on.

But it's not even the point. And preventing SQL injections (through healthy programming practices) doesn't make things worse for any legitimate user neither.

catlifeonmars 2 days ago [ - ]

It’s both. You should sanitize your inputs because there are bad actors, but you also categorize attempts to sql inject as abuse and there is legal recourse.