Hacker News

[flagged]

I always wonder if the general sentiment toward genai would be positive if we had wealth redistribution mechanisms in place, so everyone would benefit. Obviously that's not the case, but if you consider the theoretical, do you think your view would be different?

jakelazaroff 3 months ago [ - ]

To be honest, I'm not even sure I'm fully on board with the labor theft argument. But I certainly don't think generative AI is such an unambiguous boon for humankind that we should ignore any possible negative externalities just to advance it.

ThrowawayR2 3 months ago [ - ]

> "To someone who believes that AI training data is built on the theft of people's labor..."

i.e. people who are not hackers. Many (most?) hackers have been against the idea of copyright and intellectual property from the beginning. "Information wants to be free." after all.

Must be galling for people to find themselves on the same side as Bill Gates and his Open Letter to Hobbyists in 1976 which was also about "theft of people's labor".

anonym29 3 months ago [ - ]

[flagged]

sirwhinesalot 3 months ago [ - ]

It's not free. There is a license attached. One you are supposed to follow and not doing so is against the law.

anonym29 3 months ago [ - ]

[flagged]

sirwhinesalot 3 months ago [ - ]

I'm not whining in this case, just pointing out "they gave it out for free" is completely false, at the very least for the GNU types. It was always meant to come with plenty of strings attached, and when those strings were dodged new strings were added (GPL3, AGPL).

If I had a photographic memory and I used it to replicate parts of GPLed software verbatim while erasing the license, I could not excuse it in court that I simply "learned from" the examples.

Some companies outright bar their employees from reading GPLed code because they see it as too high of a liability. But if a computer does it, then suddenly it is a-ok. Apparently according to the courts too.

If you're going to allow copyright laundering, at least allow it for both humans and computers. It's only fair.

shkkmo 3 months ago [ - ]

> If I had a photographic memory and I used it to replicate parts of GPLed software verbatim while erasing the license, I could not excuse it in court that I simply "learned from" the examples.

Right, because you would have done more than learning, you would have then gone past learning and used that learning to reproduce the work.

It works exactly the same for a LLM. Training the model on content you have legal access to is fine. Aftwards, somone using that model to produce a replica of that content is engaged in copyright enfringement.

You seem set on conflating the act of learning with the act of reproduction. You are allowed to learn from copyrighted works you have legal access to, you just aren't allowed to duplicate those works.

sirwhinesalot 3 months ago [ - ]

The problem is that it's not the user of the LLM doing the reproduction, the LLM provider is. The tokens the LLM is spitting out are coming from the LLM provider. It is the provider that is reproducing the code.

If someone hires me to write some code, and I give them GPLed code (without telling them it is GPLed), I'm the one who broke the license, not them.

shkkmo 3 months ago [ - ]

> The problem is that it's not the user of the LLM doing the reproduction, the LLM provider is.

I don't think this is legally true. The law isn't fully settled here, but things seem to be moving towards the LLM user being the holder of the copyright of any work produced by that user prompting the LLM. It seems like this would also place the enfringement onus on the user, not the provider.

> If someone hires me to write some code, and I give them GPLed code (without telling them it is GPLed), I'm the one who broke the license, not them.

If you produce code using a LLM, you (probably) own the copyright. If that code is already GPL'd, you would be the one engaged in enfringement.

Yeask 3 months ago [ - ]

[flagged]

zephen 3 months ago [ - ]

[flagged]

shkkmo 3 months ago [ - ]

> You seem set on conflating "training" an LLM with "learning" by a human.

"Learning" is an established word for this, happy to stick with "training" if that helps your comprehension.

> LLMs don't "learn" but they _do_ in some cases, faithfully regurgitate what they have been trained on.

> Legally, we call that "making a copy."

Yes, when you use a LLM to make a copy .. that is making a copy.

When you train a LLM... That isn't making a copy, that is training. No copy is created until output is generated that contains a copy.

belorn 3 months ago [ - ]

Everything which is able to learn is also alive, and we don't want to start to treat digital device and software as living beings.

If we are saying that the LLM learns things and then made the copy, then the LLM made the crime and should receive the legal punishment and be sent to jail, banning it from society until it is deemed safe to return. It is not like the installed copy is some child spawn from digital DNA and thus the parent continue to roam while the child get sent to jail. If we are to treat it like a living being that learns things, then every copy and every version is part of the same individual and thus the whole individual get sent to jail. No copy is created when installed on a new device.

shkkmo 3 months ago [ - ]

> we don't want to start to treat digital device and software as living beings.

Right, because then we have to decide at what point our use of AI becomes slavery.

zephen 3 months ago [ - ]

[flagged]

shkkmo 3 months ago [ - ]

[flagged]

zephen 3 months ago [ - ]

[flagged]

dang 3 months ago [ - ]

You both broke the site guidelines badly in this thread. Could you please review https://news.ycombinator.com/newsguidelines.html and stick to the rules? We ban accounts that won't, and I don't want to ban either of you.

shkkmo 3 months ago [ - ]

[flagged]

dang 3 months ago [ - ]

shkkmo 3 months ago [ - ]

I'm polite in repose to being repeatedly called names and this is your response?

If you think my behavior here was truly ban worthy than do it because I don't see anything in the I would change except for engaging at all

dang 3 months ago [ - ]

This is the sort of thing I was referring to:

> Instead of bothering to read and understand you have continued to call names.

> You seemed confused, you still seem confused

> your pointless semantic nitpick

> you need to get some more real world experience

I wouldn't personally call that being polite, but whatever we call it, it's certainly against HN's rules, and that's what matters.

Edit: This may or may not be helpful (probably not!) but I wonder if you might be experiencing the "objects in the mirror are closer than they appear" phenomenon that shows up pretty often on the internet - that is, we tend to underestimate the provocation in our own comments, and overestimate the provocation in others' comments, which in the end produces quite a skew (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...).

zephen 3 months ago [ - ]

Sorry, and thanks.

I know moderation is a tough gig.

michaelsshaw 3 months ago [ - ]

We spread free software for multiple purposes, one of them being the free software ethos. People using that for training proprietary models is antithetical to such ideas.

It's also an interesting double standard, wherein if I were to steal OpenAI's models, no AI worshippers would have any issue condemning my action, but when a large company clearly violates the license terms of free software, you give them a pass.

ronsor 3 months ago [ - ]

> I were to steal OpenAI's models, no AI worshippers would have any issue condemning my action

If GPT-5 were "open sourced", I don't think the vast majority of AI users would seriously object.

sirwhinesalot 3 months ago [ - ]

OpenAI got really pissy about DeepSeek using other LLMs to train though.

Which is funny since that's a much clearer case of "learning from" than outright compressing all open source code into a giant pile of weights by learning a low-dimensional probability distribution of token sequences.

anonym29 3 months ago [ - ]

I can't speak for anyone else, but if you were to leak weights for OpenAI's frontier models, I'd offer to hug you and donate money to you.

Information wants to be free.

jakelazaroff 3 months ago [ - ]

> The difference is that people who write open source code or release art publicly on the internet from their comfortable air conditioned offices voluntarily chose to give away their work for free

That is not nearly the extent of AI training data (e.g. OpenAI training its image models on Studio Ghibli art). But if by "gave their work away for free" you mean "allowed others to make [proprietary] derivative works", then that is in many cases simply not true (e.g. GPL software, or artists who publish work protected by copyright).

grandinquistor 3 months ago [ - ]

What? Over 183K books were pirated by these big tech companies to train their models. They knew what they were doing was wrong.

michaelsshaw 3 months ago [ - ]

Perhaps you should Google the definition of metaphor before commenting.

refulgentis 3 months ago [ - ]

[flagged]

mmooss 3 months ago [ - ]

You're changing the subject. What about the actual point?

refulgentis 3 months ago [ - ]

[flagged]

beeflet 3 months ago [ - ]

[flagged]

anoncareer0212 3 months ago [ - ]

[flagged]

beeflet 3 months ago [ - ]

[flagged]

3 months ago [ - ]

[deleted]

jakelazaroff 3 months ago [ - ]

I mean, yeah, if you omit any objectionable detail and describe it in the most generic possible terms then of course the comparison sounds tasteless and offensive. Consider that collecting child pornography is also "storing the result of an HTTP GET".

refulgentis 3 months ago [ - ]

[flagged]

jakelazaroff 3 months ago [ - ]

[flagged]

anoncareer0212 3 months ago [ - ]

[flagged]

3 months ago [ - ]

[deleted]

jakelazaroff 3 months ago [ - ]

If you believe my conduct here is inappropriate, feel free to alert the mods. I think it's pretty obvious why describing someone's objections to AI training data as "storing the result of an HTTP GET" is not a good faith engagement.

anoncareer0212 3 months ago [ - ]

[flagged]

dang 3 months ago [ - ]

We've banned this account. Please don't use multiple accounts in arguments on HN. It will eventually get your main account banned as well.

https://news.ycombinator.com/newsguidelines.html

ronsor 3 months ago [ - ]

The objection to CSAM is rooted in how it is (inhumanely) produced; people are not merely objecting to a GET request.

beeflet 3 months ago [ - ]

Yes, they're objecting to people training on data they don't have the right to, not just the GET request as you suggest.

If you distribute child porn, that is a crime. But if you crawl every image on the web and then train a model that can then synthesize child porn, the current legal model apparently has no concept of this and it is treated completely differently.

Generally, I am more interested in how this effects copyright. These AI companies just have free reign to convert copyrighted works into the public domain through the proxy of over-trained AI models. If you release something as GPL, they can strip the license, but the same is not true of closed-source code which isn't trained on.

jakelazaroff 3 months ago [ - ]

Indeed, and neither is that what people are objecting to with regard to AI training data.

3 months ago [ - ]

[deleted]

tremon 3 months ago [ - ]

That's not true, since cartoon drawings and certain manga also fall in that category. Do you have any evidence that manga is produced inhumanely?

ronsor 3 months ago [ - ]

> believes that AI training data is built on the theft of people's labor

I mean, this is an ideological point. It's not based in reason, won't be changed by reason, and is really only a signal to end the engagement with the other party. There's no way to address the point other than agreeing with them, which doesn't make for much of a debate.

> an 1800s plantation owner saying "can you imagine trying to explain to someone 100 years from now we tried to stop slavery because of civil rights"

I understand this is just an analogy, but for others: people who genuinely compare AI training data to slavery will have their opinions discarded immediately.

zaptheimpaler 3 months ago [ - ]

We have clear evidence that millions of copyrighted books have been used as training data because LLMs can reproduce sections from them verbatim (and emails from employees literally admitting to scraping the data). We have evidence of LLMs reproducing code from github that was never ever released with a license that would permit their use. We know this is illegal. What about any of this is ideological and unreasonable? It's a CRYSTAL CLEAR violation of the law and everyone just shrugs it off because technology or some shit.

shkkmo 3 months ago [ - ]

You keep conflating different things.

> We have evidence of LLMs reproducing code from github that was never ever released with a license that would permit their use. We know this is illegal.

What is illegal about it? You are allowed to read and learn from publicly available unlicensed code. If you use that learning to produce a copy of those works, that is enfringement.

Meta clearly enganged in copyright enfringement when they torrented books that they hadn't purchased. That is enfringement already before they started training on the data. That doesn't make the training itself enfringement though.

zaptheimpaler 3 months ago [ - ]

> Meta clearly enganged in copyright enfringement when they torrented books that they hadn't purchased. That is enfringement already before they started training on the data. That doesn't make the training itself enfringement though.

What kind of bullshit argument is this? Really? Works created using illegally obtained copyrighted material are themselves considered to be infringing as well. It's called derivative infringment. This is both common sense and law. Even if not, you agree that they infringed on copyright of something close to all copyrighted works on the internet and this sounds fine to you? The consequences and fines from that would kill any company if they actually had to face them.

shkkmo 3 months ago [ - ]

> What kind of bullshit argument is this? Really? Works created using illegally obtained copyrighted material are themselves considered to be infringing as well.

That isn't true.

The copyright to derivative works is owned by the copyright holder of the original work. However using illegaly obtained copies to create a fair use transformative work does not taint your copyright of that work.

> Even if not, you agree that they infringed on copyright of something close to all copyrighted works on the internet and this sounds fine to you?

I agree that they violated copyright when they torrented books and scholarly arguments. I don't think that counts at "close to all copyrighted works on the Internet".

> The consequences and fines from that would kill any company if they actually had to face them.

I don't actually agree that copyright that causes no harm should be met with such steep penalties. I didn't agree when it was being done by the RIAA and even though I don't like facebook, I don't like it here either.

Alex2037 3 months ago [ - ]

>We know this is illegal

>It's a CRYSTAL CLEAR violation of the law

in the court of reddit's public opinion, perhaps.

there is, as far as I can tell, no definite ruling about whether training is a copyright violation.

and even if there was, US law is not global law. China, notably, doesn't give a flying fuck. kill American AI companies and you will hand the market over to China. that is why "everyone just shrugs it off".

goatlover 3 months ago [ - ]

The "China will win the AI race" if we in the West (America) don't is an excuse created by those who started the race in Silicon Valley. It's like America saying it had to win the nuclear arms race, when physicists like Oppenheimer back in the late 1940s were wanting to prevent it once they understood the consequences.

Alex2037 3 months ago [ - ]

okay, and?

what do you picture happening if Western AI companies cease to operate tomorrow and fire all their researchers and engineers?

Apocryphon 3 months ago [ - ]

Less slop

zaptheimpaler 3 months ago [ - ]

China is doing human gene editing and embryo cloning too, we should get right on that. They're harvesting organs from a captive population too, we should do that as well otherwise we might fall behind on transplants & all the money & science involved with that. Lots of countries have drafts and mandatory military service too. This is the zero-morality darwinian view, all is fair in competition. In this view, any stealing that China or anyone does is perfectly fine too because they too need to compete with the US.

ReflectedImage 3 months ago [ - ]

All creative types train on other creative's work. People don't create award winning novels or art pieces from scratch. They steal ideas and concepts from other people's work.

The idea that they are coming up with all this stuff from scratch is Public Relations bs. Like Arnold Schwarzenegger never taking steroids, only believable if you know nothing about body building.

Timon3 3 months ago [ - ]

The central difference is scale.

If a person "trains" on other creatives' works, they can produce output at the rate of one person. This presents a natural ceiling for the potential impact on those creatives' works, both regarding the amount of competing works, and the number of creatives whose works are impacted (since one person can't "train" on the output of all creatives).

That's not the case with AI models. They can be infinitely replicated AND train on the output of all creatives. A comparable situation isn't one human learning from another human, it's millions of humans learning from every human. Only those humans don't even have to get paid, all their payment is funneled upwards.

It's not one artist vs. another artist, it's one artist against an army of infinitely replicable artists.

oreally 3 months ago [ - ]

So this essentially boils down to an efficiency argument, and honestly it doesn't really address the core issue of whether it's 'stealing' or not.

3 months ago [ - ]

[deleted]

belorn 3 months ago [ - ]

What kind of creative types exist outside of living organisms? People can create award winning novels, but a table do not. Water do not. A paper with some math do not.

What is the basis that an LLM should be included as a "creative type"?

ReflectedImage 3 months ago [ - ]

Well a creative type can be defined as an entity that takes other people's work, recombines it and then hides their sources.

LLMs seem to match.

oreally 3 months ago [ - ]

Precisely. Nothing is truly original. To talk as though there's an abstract ownership over even an observation of the thing that force people to pay rent to use.. well artists definitely don't pay to whoever invented perspective drawings, programmers don't pay the programming language's creator. People don't pay newton and his descendants for making something that makes use of gravity. Copyright has always been counterproductive in many ways.

To go into details though, under copyright law there's a clause for "fair use" under a "transformative" criteria. This allows things like satire, reaction videos to exist. So long as you don't replicate 1-to-1 in product and purpose IMO it's qualifies as tasteful use.

zaptheimpaler 3 months ago [ - ]

What the fuck? People also need to pay to access that creative work if the rights owner charges for it, and they are also committing an illegal act if they don't. The LLM makers are doing this illegal act billions of times over for something approximating all creative work in existence. I'm not arguing that creative's make things in a vacuum, this is completely besides the point.

ReflectedImage 3 months ago [ - ]

Never heard anything about what you are talking about. There isn't a charge for using tropes, plot points, character designs, etc. from other people's works if they are sufficently changed.

If an LLM reads a free wikipedia article on Aladdin and adds a genie to it's story, what copyright law do you think has been broken?

zaptheimpaler 3 months ago [ - ]

Meta and Anthropic atleast fed the entire copyrighted books into the training. Not the wikipedia page, not a plot summary or some tropes, they fed the entire original book into training. They used atleast the entirety of LibGen which is a pirated dataset of books.

mmooss 3 months ago [ - ]

[flagged]

tombert 3 months ago [ - ]

> It's very much based on reason and law.

I have no interest in the rest of this argument, but I think I take a bit of issue on this particular point. I don't think the law is fully settled on this in any jurisdiction, but certainly not in the United States.

"Reason" is a more nebulous term; I don't think that training data is inherently "theft", any more than inspiration would be even before generative AI. There's probably not an animator alive that wasn't at least partially inspired by the works of Disney, but I don't think that implies that somehow all animations are "stolen" from Disney just because of that fact.

Obviously where you draw the line on this is obviously subjective, and I've gone back and forth, but I find it really annoying that everyone is acting like this is so clear cut. Evil corporations like Disney have been trying to use this logic for decades to try and abuse copyright and outlaw being inspired by anything.

mmooss 3 months ago [ - ]

It can be based on reason and law without being clear cut - that situation applies to most of reason and law.

> I don't think that training data is inherently "theft", any more than inspiration would be even before generative AI. There's probably not an animator alive that wasn't at least partially inspired by the works of Disney ...

Sure, but you can reason about it, such as by using analogies.

refulgentis 3 months ago [ - ]

[flagged]

beepbooptheory 3 months ago [ - ]

What makes something more or less ideological for you in this context? Is "reason" always opposed to ideology for you? What is the ideology at play here for the critics?

zwnow 3 months ago [ - ]

> I mean, this is an ideological point. It's not based in reason

You cant be serious