> If you have a public website, they are already stealing your work.
I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!
> If you have a public website, they are already stealing your work.
I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!
The problem I have, is they hammer my site so hard they take it down.
The content is for everyone. They can have it. Just don't also take it away from everybody else.
Unintentional denial-of-service attacks from AI scrapers are definitely a problem, I just don't know if "theft" is the right way to classify them. They shouldn't get lumped in with intellectual property concerns, which are a different matter. AI scrapers are a tragedy of the commons problem kind of like Kessler syndrome: a few bad actors can ruin low Earth orbit for everyone via space pollution, which is definitely a problem, but saying that they "stole" LEO from humanity doesn't feel like the right terminology. Maybe the problem with AI scrapers could be better described as "bandwidth pollution" or "network overfishing" or something.
Theft isn't far off, it seems closer to me than using the word for IP violations.
When a crawler aggressively crawls your site, they're permanently depriving you the use of those resources for their intended purpose. Arguably, it looks a lot like conversion.
> Arguably, it looks a lot like conversion.
is this why media networks are buying social ai apps
Yes I completely agree.
you're totally right about not being theft, but we have a term. you used it yourself, "distributed denial of service". that's all it is. these crawlers should be kicked off the internet for abuse. people should contact the isp of origin.
Firstly, since this argument is about semantic pedantry anyways, it's just denial-of-service, not distributed denial-of-service. AI scraper requests come from centralized servers, not a botnet.
Secondly, denial-of-service implies intentionality and malice that I don't think is present from AI scrapers. They cause huge problems, but only as a negligent byproduct of other goals. I think that the tragedy of the commons framing is more accurate.
EDIT: my first point was arguably incorrect because some scrapers do use decentralized infrastructure and my second point was clearly incorrect because "denial-of-service" describes the effect, not the intention. I retract both points and apologize.
ah, no fun, I was going to continue the semantic deconstruction with a whole bunch of technicalities about how you're not quite precisely accurate and you gotta go do the right thing and retract your statements.
boo. took all the fun out of it ;)
Sufficiently advanced negligence is indistinguishable from malice. There is a point you no longer gain anything from treating them differently.
The first is incorrect, these scrapers are usually distributed across many IPs, in my experience. I usually refer to them as "disturbed, non-identifying crawlers (DNCs)" when I want to be maximally explicit. (The worst I've seen is some crawler/botnet making exactly one request per IP -_-)
I think the second is incorrect too. DDoS is a DDoS no matter what the intent is.
I think one could argue that one. Is a DDoS a symptom? In which case the intent is irrelevant. Or is a DDoS an attack/crime? In which case it is. We kind of use it to mean both. But I think it's generally the latter. Wikipedia describes it as a "cyberattack", so actually I think intent is relevant to our (society's) current definition.
The semantics that make sense to me is that "DDoS" describes the symptom/effect irrespective of intent, and "DDoS attack" describes the malicious crime. But the terms are frequently used interchangeably.
Been there recently. Rate limit on nginx and anti-syn flood on pf solved it.
I'm being hit with 300 req/s 24/7 from hundreds of thousands of unique IP's from residential proxies. I can't rate limit any further without hurting the real users.
Yeah, IP-based rate limits are nearly ineffective these days.
I agree theft isn't a good analogy, but there is something similar going on. I put my words out into the world as a form of sharing. I enjoy reading things others write and share freely, so I write so others might enjoy the things I write. But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet. They are using my work in a way I don't want it to be used. It makes me not want to share anymore.
>but there is something similar going on [...]
No, what you're basically describing is "I shared something but then I didn't like how it ended up being used". If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing, but it's not "similar" to stealing beyond "I hate stealing"
This will slightly overlap with the other replies, but to be concise:
> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing
Yes. The entire point of Copyright and the reason it was invented is to ensure people will keep sharing things. Because otherwise people will just stop publishing things, which is a detriment to all. (Including AI companies, who now don't get new training data)
We have collectively decided that we will give authors some power to say "I don't like how my work is being used" to ensure they don't just "stop sharing".
Fair Use is an exception to that, where the public good does outweigh an individual author's objections. But critically, not such that authors stop publishing. Hence the 4th "factor" in US copyright law (which is one of the most expansive on fair use), where the "effect of the use upon the potential market for or value of the copyrighted work" is evaluated. Fair use isn't supposed to obliterate the value of the original work, or people will stop publishing again.
This is what makes AI training's status so contentious. In terms of direct copyright it is a very weak case. It is incredibly hard to prove a direct 1:1 copy from AI training data into the model and into the output, you have to argue about the architecture of LLMs, and it's incapability of separating copyrightable expressions from uncopyrightable facts.
Yet in spirit, AI training clearly violates copyright. The explicit stated purpose is to copy the works for training data, oft without any compensation or even permission, in order to create a machine that will annihilate the market for all works used.
People already are pulling back on the amount of works they share.
> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like
Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court, and for which they already had to pay billions in fines.
Just because something is publicly accessible, that does not mean everybody is entitled to abuse it for everything they see fit.
>Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court,
...the same courts that ruled that AI training is probably fair use? Fair use trumps whatever restrictions author puts on their "licenses". If you're an author and it turned out that your book was pirated by AI companies then fair enough, but "I put my words out into the world as a form of sharing" strongly implied that's not what was happening, eg. it was a blog on the open internet or something.
I never understand why anyone wants authors to not be able to enforce copyright and licensing laws for AI training. Unless you are Anthropic or OAI it seems like a wild stance to have. It’s good when people are rewarded for works that other people value. If trainers don’t value the work, they shouldn’t train on it. If they do, they should pay for it.
My own view is, I thought we were all agreed that the idea that Microsoft can restrict Wine from even using ideas from Windows, such that people who have read the leaked Windows source cannot contribute to Wine, was a horrible abuse of the legal system that we only went along with under duress? Now when it's our data being used, or more cynically when there's money to be made, suddenly everyone is a copyright maximalist.
No. Reading something, learning from it, then writing something similar, is legal; and more importantly, it is moral. There is no violation here. Copyright holders already have plenty of power; they must not be given the power to restrict the output of your brain forever more for merely having read and learnt. Reading and learning is sacred. Just as importantly, it's the entire damn basis of our profession!
If you do not want people to read and learn from your content, do not put it on the web.
If you want people to read and learn from each other, you should incentivize people to make content worth reading and learning from. Making LLM training a viable loophole for copyright law means there won’t be incentives to produce such work.
I don't think that's the case.
People getting better at writing is only going to increase the quality of the output.
Increasing both competition and tooling (by providing every writer with the world's greatest encylcopedia/thesaurus/line-editor/brainstormer/planner/etc) is only going to make writers better.
Will there be lots of people who misuse the system? Are there lots of people who use thesaurus words without knowing what they're talking about? Can't you tell the difference?
I see in LLMs a lowering of the ground floor making it easier for people to get in. This will increase the total availability of content.
I also see in LLMs a raising of the top bar making it harder to be the best. If more people are writing and more people are trying to be the best, the best is going to get better.
Consider chess. Have we suddenly stopped playing chess now that a phone can beat 95+% of people? No. The market is stronger than ever and still growing. The greatest player in the world use the chess algorithms to refine their play and the play keeps expanding in new and interesting ways.
In both writing and chess, yes, there is an explosion of low and middling play. But since when have we not always had people producing content and playing chess that when compared to the masters of the field is generally viewed as substandard?
But here's the kicker. Some people's favorite genre is badly editted fanfic. Some people genuinely derive actual pleasure from things that you or I might call garbage. And what's wrong with that? Who am I to say that you can't love clutzy firecop loves suburban housewife paperbacks? Or Zelda/Harry Potter crossfics or whatever.
Re-reading your comment, I think we’re both generally anti-corporate-fuckery. I view the current batch of copyright pearl clutching to be an argument about if VCs are allowed to steal books to make their chatbots worth talking to, and the Wine/MSoft debate about if it should be legal to engage in anticompetitive behavior by restrictive use of copyright. In both of these cases the root of the issue isn’t really the copyright as an abstract- it’s the bludgeoning of the person with less money by use of overwhelming legal costs to have a day in court.
>I never understand why anyone wants authors to not be able to enforce copyright and licensing laws for AI training.
Fair use is part of "copyright and licensing laws".
Would using an actors face and voice as training data be fair use?
What it the model then creates a virtual actor that is very close to the real actor?
>What it the model then creates a virtual actor that is very close to the real actor?
"Likeness" is a separate concept from copyrights
https://en.wikipedia.org/wiki/Personality_rights
I wish I lived in the alternative timeline where open source folks didn't look a gift horse in the mouth and actually used these tools to copy left the shit out of software to the point where proprietary closed source software has no advantage.
But instead we've got people posting "honey pots" that an LLM will immediately detect and route around.
I bet we'd cure all cancers in a month if everyone whining about slop actually went and did something about it.
It sounds like you wanted to believe you were sharing freely while sharing conditionally.
> But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet.
I don't think that's the case. I'm not even arguing they aren't the worst people on the planet - might as well be. But all is see them doing is burning money all over the place.
They’re getting the money to burn, though
If you want a good analogy, try the enclosure of the commons in the British countryside. Communally managed grasslands were destroyed by noblemen with massive herds of cattle overgrazing the land, kickstarting a land grab that effectively forced people to enclose or be left behind themselves. Property is a virus that destroys all other forms of allocation.
> nothing but thieves! cool band btw
If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?
Odd thing about cookies… they disappear after one serving.
Websites are an endless stream of cookies.
The analogy doesn’t hold.
If copying content from harddrive to another is theft, then so is DNA copying itself.
Everything is a Remix culture. We should promote remix culture rather than hamper it.
Everything is a Remix (Original Series) https://youtu.be/nJPERZDfyWc
Fine.
Me and my 9 friends stand around the cookie-serving person blocking everyone else.
It's taking all the cookies over a period of time.
The analogy was good.
how about this analogy: I created a most tasty cookie recipe. I give it out for free, and all copies have my name because I am vain person who likes to be known far and wide as the best baking chef ever. Is it ok to get the recipe, remove my name, and write in LLM-Codex as the creator? again, i'm ok with giving the recipe for free, i just want my name out there.
>Is it ok to get the recipe, remove my name, and write in LLM-Codex as the creator? again, i'm ok with giving the recipe for free, i just want my name out there.
From a legal perspective, it's a pretty clear "no". The instructions in recipes aren't copyrightable. The moral question is more ambiguous, but it's still pretty weak. Most recipes are uncredited, and it's unclear why someone can force everyone to attribute the recipe to them when all they realistically did was tweak the dish a bit. In the example above, I doubt you invented cookies.
i'm curious, do you honestly think the argument was about recipes and cookies? maybe it was an analogy? looking back up the comment tree, it does seem to be an analogy, not a discussion about ACTUAL cookies and ACTUAL recipes.
>maybe it was an analogy?
In that case it's a terrible analogy because if you can't get people to agree on the cookies case, what hope do you have to extend it to the case you're trying to apply the analogy to? It's like saying "You wouldn't pirate a movie, why would you pirate a blog post", because most people would pirate movies.
oh man.
my comment was about the very human need to be recognized for something created, made, or thought by a person. People are ok with writing blog posts, they're ok with writing software, and they're ok with give it all for free, but they want their name attached and their contribution recognized.
>my comment was about the very human need to be recognized for something created, made, or thought by a person.
And I specifically addressed that aspect:
>The moral question is more ambiguous, but it's still pretty weak. Most recipes are uncredited, and it's unclear why someone can force everyone to attribute the recipe to them when all they realistically did was tweak the dish a bit. In the example above, I doubt you invented cookies.
The cookies analogy was terrible because recipes are rarely credited, but even ignoring the terrible analogy the "recognition" argument still fails. If you wrote a blog post on how to set up kubernetes (or whatever), then it's fair enough that you get recognized for that specific blog post. If my friend asked me how to set up kubernetes, it wouldn't be cool for me to copy paste your blog post and send it over.
However similar to copyright, the recognition you deserve quickly drops off once it moves beyond that specific work. If I absorbed the knowledge from your blog post, then wrote another guide on setting up kubernetes, perhaps updated for my use case, it's unreasonable to require that you be credited. It might be nice, and often times people do, but it's also unreasonable if you wrote an angry letter demanding that you be credited. You weren't the inventor of kubernetes, and you probably got your knowledge of kubernetes from elsewhere (eg. the docs the creators made), so why should everyone have to credit you in perpetuity?
your ability to not address my argument main point is something to behold. can't tell if you're doing on purpose or not.
if humans read my blog posts and then things without credit that would be fine. i like human eyeballs and i like them on my content. that's exactly the purpose of the blog post (_in this particular example_), to get human eyeballs on the content.
>your ability to not address my argument main point is something to behold. can't tell if you're doing on purpose or not.
Or maybe you're just terrible at writing.
>if humans read my blog posts and then things without credit that would be fine.
I'm not sure how I (or anyone) was supposed to come away with this conclusion when you were writing stuff like:
"i'm ok with giving the recipe for free, i just want my name out there"
"the very human need to be recognized for something created"
"they want their name attached and their contribution recognized".
there is nothing contradictory in what i said, and if you weren't favoring a very literal interpretation of my argument you would agree.
but, in the spirit of critical reading education, what i meant is: human attention good, machine ingestion bad.
Digital information may be our first post-scarce resource. It's interesting, and sad, to see so many attempt to fit it within scarcity-based economic models.
> digital information may be our first post-scarce resource
… browses memory and storage prices on NewEgg …
Hmm.
But the word digital is distracting us.
The word information is the important one. The question isn't where information goes. It's where information comes from.
Is new information post scarcity?
Can it ever be?
Bandwidth and compute constraints make websites all but an endless stream though.
That's exactly it. It costs me real time and money to serve the 97% of fake traffic that just takes without giving me anything in return.
[dead]
It’s interesting to see twists on the old anti-piracy arguments recycled for anti-ai.
Turns out many (most?) people on the internet were never anti-copyright in the first place. They were just anti-copyright (or at least, refused to challenge the anti-copyright people) because they wanted free movies and/or hated corporations.
Many of these people live int he countries where downloading for own use is lawful, since they're paying copyright levy exactly to cover for that.
They don't have to hate the copyright.
That really depends, but the quick answer is that according to our human social contract, we'd just ask "how many can I take?". Until now, the only real tool to limit scrapers has been throttling, but I don't see any reason for there not to be a similar conversational social contract between machines.
Isn’t robots.txt such a “social contract between machines”? But AI scrapers couldn’t care less.
I will copy the supermarket and paste it somewhere else.
I'm also going to download a car.
If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?
Depends on the trust level of your society. where the store resides.
The internet is a cesspool of vagrants, thieves, mentally unstable, people and software with no impulse control, pirates and that is just talking about corporations. It gets so much worse with individuals.
This is a dishonest analogy. In your example, there is only a limited amount of cookies available. While there is no practical limit on the amount of time a certain digital media can be viewed.
You are allowed to take one cookie. But you are allowed to view a public website multiple times if you so want.
[dead]
[flagged]
> If I can poison them and their families, I will.
Don't post anything online that you don't want to be brought up in court later.
Like the OP's solution it was about scrapers and the models they share their data with.
Wow, how did you manually hand-write 6 million web pages? That is impressive. It would take me a while to even montonically count that high.
You're trying to use a quite unfunny "sarcasm" to move the goalpost to the strawman (they never claimed they handcrafted these pages) and quickly gloss ove the fact it's 20 years of work so why not?
You're ascribing an adversarial attitude to me which is actually held by nobody except yourself. The question was genuine and out of curiosity, and they can answer for themselves, however they choose. From the posting guidelines:
> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
I am a friend, not a foe, and so are your other fellow HN posters.
There sure is a limit in the load that the server you're DDoSing can take or the will for people to post new worthy content in public. The supply is limited just not at the first degree. Let's make a small edit: Are you allowed to take all the cookies and then sell them with a small ribbon with your name on it ?
Their is no arguing with pirates. They’ll take what’s yours and forget about you while you tend to the ashes.