Does this break part 4 of the Goodreads TOS?
"[...] you agree not to sell, license, rent, modify, distribute, copy, reproduce, transmit, publicly display, publicly perform, publish, adapt, edit or create derivative works from any materials or content accessible on the Service. Use of the Goodreads Content or materials on the Service for any purpose not expressly permitted by this Agreement is strictly prohibited."
Also did the reviewers give you permission to fed their content into an LLM?
Fairly meaningless in this day and age. Also IIRC scraping legality depends heavily on jurisdiction. Some places take a more permissive view of accessing publicly available information, even if a site's TOS forbids bots.
In the US there’s a major precedent [0] which held that scraping public-facing pages isn’t a CFAA "unauthorized access" issue. That’s a big part of why we’ve seen entire venture-backed scraping companies pop up - it’s not considered hacking if the data is already public.
[0] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
From that article:
> However, after further appeal in another court, hiQ was found to be in breach of LinkedIn's terms, and there was a settlement.
So why would the same not apply here?
They settled out of court, that doesn't mean that they were found to be in breach of the terms.
These were some of the notable elements (worth noting that none mention breaching terms of service):
> Damages: Judgment in the amount of $500,000 is entered against hiQ, with all other monetary relief waived.
> CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”
> California “CFAA”: hiQ stipulates that LinkedIn “may establish civil liability” under California’s state-law counterpart to the CFAA based on hiQ’s data collection practices, use of fake accounts and other means to evade detection by LinkedIn, hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts, and hiQ’s unauthorized commercial use of data.
> Trespass: hiQ stipulates that LinkedIn has established judgment as to liability under California law for the common law torts of trespass to chattels and misappropriation.
> Irreparable harm: hiQ stipulates that LinkedIn has established that it has suffered an irreparable injury and that LinkedIn satisfied the remaining factors and is entitled to a permanent injunction.
https://natlawreview.com/article/hiq-and-linkedin-reach-prop...
A settlement means there was no legal ruling and no precedent set. The entire case is legally moot.
In America, you can simply pay to not lose any lawsuit ever, and thus never have to face legal consequence or changes to the law you don't like.
Paying money is the legal consequence in most situations where one loses a civil case. Very rarely is any legal precedent a result of the case even if the court finds the party liable.
And this is the system in most of the world, even if the nomenclature is different between common law countries and civil and others.
it's only legal if you have a team of lawyers though. the law still applies to the rest of us.
It is the future: I own nothing, and I've never been happier. They can sue me and take nothing.
I've been trying to convince myself I'd be able to live like Diogenes, sleep in the streets, bathe in the sea and just generally survive off scraps - but I think that only works if other people can afford to throw away scraps.
If you live in the west it’s no problem - the amount of waste there is insane
It won't be after the billionaires own everything and the rest are living off scraps already.
> CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”
This was part of the terms of the settlement.
So if you are legally allowed to "adapt, edit or create derivative works from any materials", what's the point of the TOS?
The TOS specify the circumstances in which the corp may take action that is unrelated to the legal system. Just because they can't sue you (and easily win) for scraping, doesn't mean they can't block you if they notice you doing it.
Google for example has a TOS and is well known for permanently banning accounts for real or imagined or AI-generated violations of it. Google banning you for breaking TOS doesn't mean you broke the law, just that you broke their rules, which apparently include a clause against being in the wrong place at the wrong time.
I believe TOS is binding as long as it doesn't conflict with the law. If something is deemed fair use under the law, TOS cannot override those legal rights.
Legal rights are signed away all the time in contracts though.
Aren't there some limits on which legal rights can be signed away through a contract?
I imagine that a contract in which someone agrees to become a slave would be void.
Sure there are limits, but they seem to apply to really egregious things like selling your life or vote. We're talking about signing away the right to fair use, not entering into indentured servitude. People sign non-disparagement agreements all the time and this seems more like that than slavery.
That’s a good question. It also would not be the first time that companies use trickery and manipulation or even deliberately illegal practices for various business/financial reasons. At the very least it could be used as a tool to underpin intimidating lawsuits and another step up, regardless of the legality in the relevant jurisdiction, it could be used to influence official government foreign policy to exert pressure on a jurisdiction that permits scraping.
Tell that to judyrecords with the same smug attitude.
Your textbook versus reality conceptualization of things is dogshit. It’s exploitation to do what OP did. You’re endorsing it and minimizing the ethics and this certainly shall poison the well from which you drink. Godspeed.
This is so overly dramatic it’s hard to even consider the point you’re trying to make.
You ok bud? You sound unhinged here. You post doesn't even make sense in context of the one you were replying to.
What expectation of confidentiality are you ascribing to people having posted publicly accessible opinions on the internet?
Out of curiosity, is your point about TOS out of concern for the poster or for Goodreads?
My expectation isn't of confidentiality, but of attribution. Sure, my website is perfectly accessible on the internet, and I'm fine with being able to find it on google, but if you pipe it into an algorithm that will start throwing out stuff based on what I wrote, with zero reference to me at all, I'd get a bit annoyed. This website has taken the combined output of probably thousands of people, shoved it into an algorithm and is then using their work to give "original" ideas. If one person wanted their content removed from the system, how would you do that?
What does that comment have to do with confidentiality?
That he viewed a review on Goodreads as the reviewer’s intellectual property hadn’t occurred to me. I see why, in aggregate, many such opinions become valuable, but the whole is more than the sum of its parts.
So does it feel to you guys like your comments, say, here in this Hacker News thread should be considered effectively copyrighted as your personal IP?
If so, do you feel the same way about opinions you share out in a supermarket or on the street?
Of course comments are copyrighted, if they happen to contain text that is novel. As an example, in the reddit TOS, they require commenters to license their comments to reddit.
> If so, do you feel the same way about opinions you share out in a supermarket or on the street?
Well being novel isn't the only criteria for copyright, the work must also be "fixated", and opinions in a supermarket usually isn't (but they can be, if I film them and post on reels or something; then the video itself is copyrighted)
https://copyrightalliance.org/education/copyright-law-explai...
> Fixation
> To meet the fixation requirement, a work of authorship must be fixed in a tangible medium of expression. Protection attaches automatically to an eligible work the moment the work is fixed. A work is considered to be fixed so long as it is sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration.
There are well established legal standards for what is copyrightable and I believe written literary criticism trivially qualifies (as it should). Stuff you yell at the supermarket doesn't, IIUC, as it isn't fixed in a tangible form. Social media comments are, IIUC, generally protected. The exception would be comments that don't meet the bar to be considered "original", "creative", etc.
(not a lawyer)
Technically speaking none of Goodreads material or content is being used publically, the only information displayed on the site is freely available (Title, Author) and not Goodread's property.
You could try to argue that this falls under "create derivative works from any materials or content accessible on the Service" but even then it seems really flimsy to say that recommending books based on Goodread reviews is an infringemnt.
It's just not that different to a youtuber saying "I read reviews for 50 books, here's the ones to read"
I'd be impressed if a youtuber could read 3 billion reviews and recommend books to you based on that
what about youtuber that build a machine that scrape 3 billions books and make recommendation based on the data????
Skip that step. This project enables a Youtuber that automates pulling related booklists from this site, and uses AI to make the recommendation videos. Thousands of videos.
I visit your garden and take 1 apple from your tree
I visit your garden and take 1000 apples from your tree.
Not that different.
Not only am I taking 1,000 apples, but I use those 1,000 apples to start my own orchard and encourage people to come to it instead of yours.
Yeah but if I program a drone swarm to automate this process it’s for the greater good — more apples for everyone!
And I only charge a tiny subscription for access to all my drone-managed orchards, you can eat as many apples as you want. But don’t steal any and start your own orchard or I sue.
All the people who care for the trees and pick the apples have lost their job while an apple became nearly worthless, but without a job it‘s still unaffordable.
Replace your drones with China or India and you have the current situation in the US.
Apple farmers go out of business so you lose the people who create new varieties.
> but I use those 1,000 apples to start my own orchard
Steal cuttings, not the fruit, if you plan to start an orchard. From 1000 apples you'll get ~10 000 seeds, statistically you won't even end up with one good tree.
> An output of three cultivars from around 50.000 seeds means that 17.000 seeds were needed to get one cultivar. Only one out of around 9.000 scab resistant seedlings showed the appropriate quality to become a cultivar. This proportion underlines the enormous effort which is necessary to develop a new cultivar.
https://orgprints.org/id/eprint/13698/1/220-225.pdf
... and somehow your garden did not lose any apple in the process.
But you are an apple seller
Not a great analogy, since a digital copy leaves the original intact unlike your apples
For every apple I take, you still have your apple on the tree, because my apple is only a copy of yours.
At what point are they feeding reviews into an LLM? From what I got the only personal data they're using is which user read which books.
I’m not taking sides in this debate, however since feeding whole books into LLMs is considered legal fair use now, I guess these reviews don’t require a permission as well. Would be great to hear a professional lawyer take on this.
The hidden gotcha in the Anthropic judgement (which I think is what you’re referencing?) is that feeding whole books into LLMs is considered legal fair use if you obtain them legitimately.
I suspect we need to wait for the NYT (and others) case to be decided before we know whether scraping sites in contravention of their terms is also fair use for LLM training.
My own opinion (as someone who creates written content on an occasional professional basis) is that if you can’t monetise your content in some other way than blocking people from accessing it then your content probably isn’t as valuable as you think.
But at the same time that’s tricky when it’s genuine journalism, as in NYT’s case.
Obviously user generated content reviewing books online is rather different because the motivation of the reviewers was (presumably) not to generate money. And, indeed, with goodreads there’s a strong argument that people have already been screwed over after their good faith review submissions were packaged up as an asset and flogged to Amazon. A lot of people were quite upset by that when it happened a decade or so back.
So from a ‘moral arguments’ perspective I don’t think scraping goodreads is as problematic as other scraping examples.
(Sorry, none of this was aimed at you - your comment just got me thinking and it seemed as good a place as any to put it!)
Goodreads offers those reviews up publicly by serving them from their webservers to anyone who asks for it.
Sorry, I don’t understand the point you’re making. I know that these are publicly available - the point I was making, drawing off the parent comment, is that where it has been deemed fair use in copyright to use books to train LLMs when the content has been legitimately obtained then a similar assessment might apply for this sort of ingestion.
If content is publicly available that does not necessarily mean it’s free of copyright control: the justification for using the reviews to train an LLM would be based on the fact that fair use means it is not an infringement of copyright. But if the publisher has terms that forbid scraping then that may mean the fair use argument is undermined if it is precedent in the content being legitimately obtained. I’m not a lawyer but it’s quite easy to see how “books can be used for LLM training under fair use but not if you pirate them” extends to “content on the web can be used for LLM training under fair use but not if you’ve breached the terms set out by the publisher”.
This is, essentially, why I've withdrawn from posting content from my human brain almost anywhere on the open internet (except here, sometimes) and have retired blog posts, opinions, and so on to our friends WAN.
Why ask questions you already know the answers to?
Because some tech adjacent people still have morals?
[dead]
If it's on the internet, and people can access it, then it's public. I would have no expectations for what people do with public data; that just seems like setting yourself up for disappointment.
Is a pirated movie, found on bittorrent, public?
IMO, your definition is overbroad
If it's on bittorrent then, yes, it's public. It doesn't matter if you intended it to be or not, it's publicly accessible, therefore it's public.