To be very clear on this point - this is not related to model training.

It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.

Buying used copies of books, scanning them, and training on it is fine.

Rainbows End was prescient in many ways.

> Buying used copies of books, scanning them, and training on it is fine.

But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.

That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.

> It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation

And thank god they did. There was no perfectly legal channel to fix the taxi cartel. Now you don't even have to use Uber in many of these places because taxis had to compete - they otherwise never would have stopped pulling the "credit card reader is broken" scam, taking long routes on purpose, and started using tech that made them more accountable to these things as well as harder for them to racially profile passengers. (They would infamously pretend not to see you if they didn't want to give you service back when you had to hail them with an IRL gesture instead of an app..)

i dont know that its such a great thing in the end. Uber/Lyft is 50-100% more expensive now than taxis were before. Theyre entrenched in different ways.

Idk how it is in the US but in eastern Europe that's only true if surge is on and even so considering how shitty the quality of service was before Uber it's fine.

And it’s still shitty. Uber/Bolt is like on par with 90s taxis. At least here there was a short attempt to make things better in early 2010s with nicer cars and trying to force drivers to be nicer. But then it was „disrupted“.

I far, far, far prefer Uber (or Lyft, in the US) wherever I am, over whatever local taxi service there is. Yes, the quality of cars varies a lot. Yes, you never know if you're going to get a quiet driver or a way-too-talkative one.

But I know what I'm going to pay up-front, can always pay with a credit card (which happens automatically without annoying post-trip payment), the ride is fully tracked, and I can report issues with the driver that I have an expectation will actually be acted upon. And when I'm in another country where there are known to be taxis that scam foreigners, Uber is a godsend.

Yes, pre-Uber taxis were expensive and crappy, and even if Uber is expensive now, it's not crappy; it's actually worth the price. And I'm not convinced Uber is even that expensive. We always forget to account for inflation... sure, we might today say, "that ride in a taxi used to cost $14, but in an Uber it costs $18". But that ride in a taxi was 15 years ago.

If you think that Uber deal was not thought out well it didn't get a chance when the drivers became AI autonomous and hacked the old drivers out of car. Then as they threw out the driver police hit there lights and let the AI go with a warning for throwing trash out but us passengers got raped by a feeldown seizure of mess to ride with the crazy driver not paid to controlled substances felony party too other passengers. So in response to nontraffic citation and no evidance because self incriminating with forced treatment to more meds but not allowed? What a choice red pill or six blue pills all illegal to have.

The disruption worked in most cities I use uber in. It’s far more trustworthy to use uber.

Uber did a great job convincing lay people that taxis were ripoffs and they were a good deal. For some time that was probably true.

Now, I see people at the airport walk over to the pickup lot, joining a crowd of others furiously messing with their phones while scanning the area for presumably their driver.

All the while the taxis waiting immediately outside the exit door were $2 more expensive, last time I checked.

Uber didn't have to convince anyone, taxis were ripoffs. It didn't even have to always do with money. Taxis asked people where they were going and drove off if it wasn't far enough was a significant issues. Taxis not picking up black people. Many taxis in my town were dirty and and the drivers were jerks or creepy or both. With protections built into law and no competition the industry didn't have to even try to cater to the customer.

The taxi industry sealed it's own death warrant a long time ago. Ride sharing services solved a real problem at the right time. If that cost a bit more, it was well worth it. I won't take a taxi now unless I am forced to.

In NYC, Vegas, and a few other places I take taxis because they're dense and work well there.

Uber was a godsend for everyone living outside of like 4 metro areas in the US.

It helped that they started in places like San Francisco, where the taxi cartel was so absurdly terrible that you'd win fans just by showing up.

I lived in SF when Uber started. We used to call Veteran's Cab because they were the only company that wouldn't ditch on the way to pick you up, but it was completely normal to wait more than an hour for a cab in the dark hinterlands of 24th and Dolores or the industrial wasteland of 2nd and Folsom. An hour during which you had to be ready to jump as soon as the car arrived. Everybody had at least one black-car driver's cell number for downtown use because if they happened to be free, you could at least get picked up.

Uber would have had a religious following of fanpersons even if all they'd done was an estimated pickup time that was accurate to within 20 minutes.

Where I am, the taxi from the airport is about $5 more expensive during off peak, but it can be $20 cheaper during peak hours. I always take the taxi since it's right there, but I usually check the price on Lyft or Uber just to compare.

I know how much my ride will be and I know it doesn't vary based on what happens along the way. L

That's funny - ride fares change, and only in an Uber have I been kicked out of the car "because the app crashed" in the middle of an abandoned road, or had a very intoxicated person pick me up, or try to drive recklessly in hazardous conditions.

I happily pay a premium for none of these things again.

Depending on the country, they are paid more fairly as well, are insured etc.

The Taxi mafia had to go but Uber and co. are still questionable benefactors.

Not at any airport I've been to recently. I've never seen lines of taxis waiting at any airport in the last few years. There are empty taxi slots. People hail the taxi using an app and then wait for it to show up. Just like Lyft/Uber.

I mean, that seems pretty unfair, no, giving one set of transportation companies an arbitrary advantage over another? This sort of thing is exactly why Uber started in the first place: because taxis had unfair monopolistic advantages for no particular reason, and gave customers a poor experience, because they knew they didn't have to do better to keep their jobs.

I have no idea what I'm going to get with those taxis waiting immediately outside the exit door. Even in my home country, at the airport next to my city, I have no idea. I know exactly what I'm getting with an Uber/Lyft, every time. That's valuable to me.

I was just in another country a couple months ago, and when trying to leave the airport, I was confused where I'd need to go in order to get an Uber. I foolishly gave up and went for one of those "conveniently-waiting" taxis, where I was quoted a price up-front, in my home currency, that I later (after doing the currency conversion on the Uber price) realized was a ripoff. The driver also aggressively tried to get me to instead rent his "friend's car" rather than take me to the rental car place like I asked. And honestly I consider that lucky: he didn't try to kidnap me or threaten me in any way, but I was tense during the whole ride, wondering if something bad was going to happen.

That sort of thing isn't an anomaly; it happens all the time to tourists in many countries.

There are many schemes nowadays on Uber cars. I know some stories in developing countries where people are robbed and even killed because they foolishly think that by getting a Uber this means a safe ride. In some countries a regular taxi is actually better regulated and safer than Uber.

In my home country (New York) the taxi mafia was harsh and cruel, but they always did a good job.

> unfair monopolistic advantages for no particular reason

Is that true?

In the US, as well.

I won't recount what recently happened to a friend in Milwaukee. It was an unpopular story (because the ripoff was Uber-based, and not the traditional taxi).

There's bad actors in every industry. I have found that industries that get "entrenched," tend to breed the most bad actors.

If anything turns into a "pseudo-monopoly," expect the grifters to start popping up. They'll figure out how to game the system.

[deleted]

In India, most taxis I ran across at the airport were 50% more expensive - after haggling!

Did you remember to factor in well over 30% inflation in America in the past 5 years plus Uber Lyft initially losing money on rides to capture market share before they eventually had to actually breakeven?

> plus Uber Lyft initially losing money on rides to capture market share before they eventually had to actually breakeven?

That's typically considered to be somewhere between assholish and straight up illegal in most civilized economies.

What law is it breaking?

https://en.wikipedia.org/wiki/Predatory_pricing#Legal_aspect...

In all those countries what’s illegal is abuse of a monopoly, which is not what’s being discussed here. The parent cited Uber and Lyft when they first started. Nothing is illegal about startups undercutting established competitors.

No you’re missing the point.

They acquired market power by killing them through predatory pricing, leaving incumbents unprofitable and forcing them to exit - while creating a steep barrier to entry for any new comers and strategically manipulating existing riders by offering high take rates initially and subsidising rides to create artificial demand and inflate market share - then once they kicked out the incumbents, they exercised their market power to raise prices and their % of the take rate of each transaction; leaving consumers and riders worse off.

We can talk all day about the nice UX blah blah. But the reality is, financially, they could not have succeeded without a very dubious and unethical approach.

I get why we look on Uber with disdain today. They're the big rich behemoths who treat drivers poorly, previously had a CEO who was a raging asshole, and have now raised their prices (gasp!) to a level that they need to be for a sustainable business.

But I remember when I started using Uber back in 2012. It was amazing compared to every single other option out there. Yes, they entered the market in questionably-legal or often probably outright illegal ways. But illegal is not the same thing as immoral. And I don't think it's unethical to force out competition when that competition is a lazy, shitty, legally-enforced monopoly that treats its customers poorly.

Yes ... THAT was when governments should have stepped in and prevented uber from undercutting taxi drivers with investor money.

As pointed out here, many governments have laws stating that they will step in ... and they didn't.

Do you feel like the taxi medallion system was a better regulatory mechanism than what is currently in place?

https://en.wikipedia.org/wiki/Taxi_medallion

[deleted]

> But illegal is not the same thing as immoral.

Creating the gig economy doesn't get any moral points from me.

Okay but is that illegal?

I can only speak in EU terms in any more detail here, but the EU laws are based on "dominant market position". Monopoly is one route to that but it's not the only route and there is no minimum market share required, as e.g. Qualcomm found out (https://www.cliffordchance.com/insights/resources/blogs/talk...)

Which EU country reacted against Uber's predatory pricing when it was actually happening? Ie. which EU government refused investor money flowing into their economy? The only examples I can find are a few cities, and some of those are in the US. No EU state did, unless I'm missing something.

Sure now that it costs them money, they're reacting, making things worse for literally everyone: the taxi drivers, who've been victimized by the governments not reacting when they should. The customers, who are now paying more. The Uber drivers, who are certainly not the ones getting the money.

A great lawyer will tell you laws don't matter if they're not applied, and then tell you how laws are applied and what you can and can't get away with (this is a necessity since most laws aren't very clear at all, especially where it comes to actual real-world cases or penalties). The EU are absolute masters of that. The famous GPDR, for example, isn't protecting anyone's data in any way it matters since governments have the power to grant themselves exceptions to them. Which lead to all the things the GPDR tried to avoid: insurance getting private medical data (who are mostly part of governments in the EU), private medical data being used by the police or in court, just to give some examples.

Hell, it's now been confirmed every 2 years or so since 2015 that essentially all European countries think all of the FANGs are abusing their market position. Google, Facebook, Amazon, Apple, ... they've given them billions of dollars in fines. Tell me, what has been fixed? US advertising companies are deeper entrenched than ever before (even outside of the internet, ie. ClearChannel). Law is supposed to fix the problems. Well, obviously the problem of US companies' dominance is not solved, in fact it's gotten a lot worse.

And this is nothing new. Take what EU countries signed in the Budapest memorandum. You will find that it states that if Russia ("any of the ... blabla", which includes Russia) takes Crimea a bunch of EU countries (France, UK) would, first, declare war on the country that did it (Russia) and initiate actual hostile action against that country (ie. not just support to Ukraine). That meant they agreed to have UK and French (and ...) soldiers attack Russia. That was the security guarantee Ukraine had, and that was an international treaty, which in the EU (look it up) has the power of law.

As everyone and their grandmother's cat knows, they didn't actually follow through. They "gave support". That's just one, at the moment important, example.

And of course, the effect is the same: it became worse and worse. Russia's actions became worse and worse and worse. Now the EU countries have given the same guarantees for countries like Poland, Latvia and even Estonia, either directly or through NATO. Will Russia attack? Why not? It's not like these countries will (or let's be real: can) actually fight under any circumstance.

A couple EU countries bans on Uber seem to date back from 2015-2019, which is slow, but still fairly early as to worldwide adoption per https://dig.watch/trends/uber

Example ban in Finland: https://www.uber.com/fi-/blog/uberpop-tauolle/

After few years of operation, government realised it was serious and pressured Uber to stop taxi operations « Uber pop », until disruption in legislation got through.

I used Uber from first year it was here. As the service got popular with young adults and the people took notice and public debate began, the police was instructed to fine Ubers. Then the drivers asked us passengers to sit up front and pretend we were friends. (Not sure if the app had instructions related to this or not.) Once the legislation change was clear, they closed operation officially for the brief period, as stated in the article.

I just thought it was exciting at the time..

Page not found ...

And Uber is available in Finland: https://www.uber.com/global/en/r/finland/cities/

For what it is worth what Wikipedia says about the document you mention is not what this comment mentions. Personally I found that comment spreading disinformation.

No country gave guarantees only assurances and it is even highlighted that the US senate would have never voted for it favourably, and thus it never was a treaty.

On the other hand breaking this assurances will guarantee no other country will ever give up their nuclear arsenal, of course a non consolation price for Ukraine. Guarantees in nato which is indeed a treaty and ratified, covering Poland and Latvia and Estonia would be stronger but of course, I would not put all my eggs on it.

> Which EU country reacted against Uber's predatory pricing when it was actually happening?

Bulgaria kicked out Uber for not obeying taxi regulations.

Sounds unrelated? Well it used to be a socialist dictatorship and laws are still written in a ham-fisted-yet-vague* way so that (1) you can't realistically obey them and (2) they can be used against anyone state authorities (or their friends) don't like.

So what's the actual reason? Uber was on its way to price taxi companies out of the market by offering better service at a price of €0.25/km.

* If you're from a developed country and this sounds like what your government is currently doing, you should start panicking.

I can't find news on that, and Uber is available in Bulgaria:

https://www.uber.com/bg/en/

I believe the equivalent for international trade is called "dumping" and is somewhat regulated, although that doesn't apply to Uber.

How much of this is inflation?

Gas is priced lower when counting for inflation, isn’t it?

But drivers got to eat

Why don't they just order it on food delivery. I heard that it massively cut margins on the greedy restaurants, so can't be inflation there...

who cares about the drivers....

Especially because in 10 years from now they will progressively get replaced by AI like Waymo, so no point into making sure they are happy in the long-term

Gas is such a small part of the cost

As far as I was aware taxis were an imagined thing we saw in movies. I understand you could call a number and ask for a ride to the airport though they were never useful.

They're always more common in metro areas of the US. You must be from a relatively rural area and don't get out of it much.

That said, uh, the use of getting a taxi to drive you to or from the airport was just not having to park at the airport which generally costs a lot of money, and in certain areas is a little sketchy on whether or not your car will get cracked open while you're away.

That's a little reductive. I grew up in San Diego and went to school in LA and had the same experience with taxis - never took them. But now I use ubers in those cities whenever I'm there.

The US has tons of cities like this that I imagine would have issues with taxis - all parts of the bay area peninsula / east bay, cities in Texas, Denver, etc. Most cities are not like the NYC/Boston and even in places in Chicago, unless you lived downtown likely didn't see taxis driving around.

What?

Taxis weren’t actually available in most US cities before ride share. Only the large dense cities really had them. This argument that things were better before is only relevant for a small handful of metros. The ride shares are better in spite of their flaws.

I strongly prefer to take traditional taxis, but I also comparison shop and Lyft is almost always 20-40% cheaper than a cab ride.

That's probably due to general inflation...

So are most things from 20 years ago. Inflation is acting as the majority of those increases I’d wager.

Where I live Ubers are WAY cheaper than taxis, even if you go back years and years.

“Entrenched” because that’s how consumers prefer to spend their money?

Here in Australia theres a never ending steam of complaints about taxis managing to bill passengers extraordinary amounts. From taking a route that deliberately includes a highway leg thats expensive to correct (screws tourists), to demanding higher fares, to card skimming, to outright just not displaying the taxi licences so you cant complain and have no idea which driver was being creepy.

Uber at least has fixed rates from what was displayed and there are logs of which driver was doing dodgy stuff.

> And thank god they did. There was no perfectly legal channel to fix the taxi cartel

And instead Uber offloaded everything onto gig workers and society. And still lost 20 billion dollars in the process (price dumping isn't cheap).

“Society” should have things like universal healthcare like every other industrial country in the world. The US is the only country with an ass backwards system where you are dependent on your employer for health benefits.

It’s by design.. America is all about using you up as an asset then discarding you when you are no longer productive and generate economic benefits.

I always laugh when Americans poke fun at Europeans… we have it much better over here. I assure you of that.

But that's the thing, isn't it? Universal healthcare isn't magic. It's paid for by taxes. Yet Uber claimed its drivers where independent contractors that had to pay for anything: taxes, medical, insurance, car depreciation etc. etc.

And that’s fine. Uber drivers should pay taxes and Uber itself pays taxes - or at least should.

And the drivers have the free will to choose to drive for Uber.

> Uber drivers should pay taxes and Uber itself pays taxes - or at least should.

Yup. The drivers should have to pay everything because despite working for Uber they are "free contractors"

> And the drivers have the free will to choose to drive for Uber

Ah yes, I forgot that's exactly how price dumping works: there are multiple companies to chose from and all of them offer competitive wages.

I mean, it's not ancient history. For half of Uber's existence the ongoing story was: drivers have to drive almost 24 hours a day to make living wage with Uber randomly stealing their wages.

This only somewhat changed once governments stepped in and forced Uber to change some of its practices.

There are multiple jobs to choose from. California’s attempt to regulate contractors was a disaster. Jason Snell, the former editor of Macworld, left to go independent and makes a living based on a combination of podcasting, writing books and freelance writing and he said how much harder the rules made it for him to do freelance writing because of the requirenments around hiring contractors.

Trust me, Snell is far from a fire breathing libertarian conservative.

It’s not the responsibility of a corporation to decide what a “living wage” is. Should Uber pay more to a single mother with three kids than a single father with no kids? Again it’s society’s responsibility to provide for a safety net and to tax corporations to fund it.

On the federal level, that’s what the earned income tax credit was suppose to do and until 2016, it had wide bi-partisan support and was championed by both Republican and Democratic Presidents.

> California’s attempt to regulate contractors was a disaster.

You have to decide whether you want the society to provide safety nets through healthcare, strong labor protections etc. or not.

> Again it’s society’s responsibility to provide for a safety net and to tax corporations to fund it.

Indeed. That's why governments and regulators eventually stepped in.

You can't in good conscience or good faith argue that Uber didn't offload anything onto society and people working for it just because "it's not the job of a company" etc. Uber literally engaged in multiple illegal and borderline illegal practices across the globe, including the US.

And yes, it's the literal job of a taxi company to make sure its drivers work a healthy amount of hours. In Uber's case it meant that it had to pay drivers enough money to cover the costs Uber offloaded onto them, and enough money left over so that they didn't have to drive 18-20 hours a day to make ends meet.

And yeah, not everyone can become Jason Snell

> You have to decide whether you want the society to provide safety nets through healthcare, strong labor protections etc. or not.

My argument is simply that the only “labor protections” the government should enforce on private enterprise is that a company can’t actively harm employees - OSHA protections, discrimination etc.

> And yes, it's the literal job of a taxi company to make sure its drivers work a healthy amount of hours. In Uber's case it meant that it had to pay drivers enough money to cover the costs Uber offloaded onto them, and enough money left over so that they didn't have to drive 18-20 hours a day to make ends meet.

It’s up to individuals to decide whether the tradeoffs are worth it. It’s not the responsibility of private industry to calculate what a “living wage” is for an individual. Uber never put a gun to anyone’s head to force them to drive for Uber. If anything the government should enforce how long someone can drive because it puts others in danger. But does the government stop people from working two jobs that might add up to 20 hours? What should happen when the driver drives for Uber, Lyft and DoorDash?

The illegal practices at least in New York were around taxi medallion monopoly where taxi drivers were getting in hundreds of thousands in debt to own them for the right to drive.

As far as not everyone being Jason Snell, there were other freelance writers and contractors like truck drivers who had to leave California to save their business

https://www.foxnews.com/opinion/i-had-leave-california-save-...

It even affected 1099 (as opposed to W2) tech workers who were contractors.

If that is the world you want, then boy are you going to love living in Somalia. You could even move today!

Now you are going to come up with an intelligent counter argument to my saying that the government should enforce laws where the employer can’t actively harm employees, where the government should respect the fact that adults have agency to make their own choices and the United States should offer universal healthcare like every other industrialized first world and second world country equates to living in Somalia…

> Uber never put a gun to anyone’s head to force them to drive for Uber.

Oh no. Uber only spent 20 billion dollars on price dumping, driving competing companies out of business, and was the poster child for gig economy.

> If anything the government should enforce how long someone can drive because it puts others in danger.

Once again, the wages Uber was paying were below substinence if you were to drive just within the safe margin of hours. Oh, I forgot, it's ridiculously easy to become a writer and sustain living from a podcast. Those ~400 000 people could've easily found a different job.

---

However, the actual insane thing is this worldview that companies are not responsible for anything, and can do whatever they want; that people have to be punished for working because it's easy to not just switch jobs but to go and start supporting yourself with books and podcasts; and that there should be some magical government that provides some safety net, but still actively punishes people if they end up at a wrong job.

> Oh, I forgot, it's ridiculously easy to become a writer and sustain living from a podcast. Those ~400 000 people could've easily found a different job.

So the only choices anyone has in the US is to become a writer or an Uber driver? Does Uber have some type of monopoly on employment?

> However, the actual insane thing is this worldview that companies are not responsible for anything, and can do whatever they want;

I said that companies shouldn’t be able to do things that harm their employees - I never said that OSHA and safety standards shouldn’t exist. They also shouldn’t be able to do anything that hurts others. I even said that they should pay taxes to fund a safety net and to provide universal health care like every other civilized company.

> but to go and start supporting yourself with books and podcasts

No I said that the government shouldn’t get involved with creating an environment where adults can’t get into voluntarily contracts where they get to decide how much their labor is worth.

Even a cursory reading of whey I wrote would tell you I used Snell as an example of all of the contractors that wanted to do freelance who were harmed by a law meant to protect them but only created a nanny state that took away agency from adults who freely made a choice.

> and that there should be some magical government that provides some safety net

You mean the same type of safety net that every other industrialized company provides?

My employer is a lot more dependable than the US government.

If you trust the overlord you didn't choose more than the one you did, then you might want to rethink your career.

What’s more likely - your company is going to get rid of you in the next five years or the government is going to take away your citizenship?

Did you try to get insurance on the open market before 2012 with a pre-existing condition? Every other industrial country in the world has health insurance not tied to your employer. Even smaller countries like Costa Rica and Panama have better more affordable insurance. Yes I’ve done my research on caja, Costa Rica’s national health care system. We will be staying there a couple of months in the winter starting next year and it’s our Plan B to retire there.

I got my job because I had a life threatening illness at the time. My employer saved me when no one else would. After spending all that money saving my life, it'd be a shame if they got rid of me before I had the chance to fully repay their kindness. There are a lot of other good countries where you can go for care. For example, if you have an impressive looking GitHub then Audrey Tang will give you a gold card. That's privileged elite immigrant status and you don't even need a college degree. You get your gold card. You go to Taiwan. The place where some of the greatest people in the world (e.g. Jensen Huang) are from. You're under no obligation to get a job. You get free health care. The ER waits are ~15 mins. If your heart bleeds red instead then the Chinese also say on RedNote that if you're a sick American, then just come on down to China and they'll treat you and take care of you, no matter how bad it is, or how long it's going to take.

You know you’re arguing my point for me that universal health care provided by taxpayer funds is better than depending on your employer for health care?

And becoming a permanent resident of Costa Rica is just a matter of either proving you have $1000 a month in guaranteed income as a retiree or $2500 a month in passive income or put $60K in a local bank account and they will dispense $2500 a month to you. They don’t tax foreign income.

The best thing about Costa Rica is the Sloth Sanctuary.

You and your wife should send me photos of yourselves with the sloths when you get there.

The supposed 'taxi cartel' were just (some) scummy operators ... not really a cartel. Fast forward to today => you are paying more for what is essentially very similar service (because it literally turned into a monopoly because of network effects) and the money ends up in the pocket of some corporate douche not even the people doing the actual work.

This is the business model: get more money out of customers (because no real alternative) and the drivers (because zero negotiating power). Not to mention that they actually got to that position by literally operating at a loss for over a decade (because venture money). Textbook anti-competitive practices.

However, the idea itself (that is having an app to order taxi) is spectacular. It also something a high-school kid could make in a month in his garage. The actual strength of the business model is the network effects and the anti-competitive practices, not the app or anything having to do with service quality.

Classic indications of a cartel (in the economic sense) are deliberate limitations of supply and fixing of prices through collusion. I don’t know about other cities, but NYC absolutely had a taxi cartel.

This is true ... except that it is simplistically naive way of looking at things, because this is just one form (out of many) of anti-competitive practices. It is essentially high-school level elementary basics of anti-trust. In actual reality there is quite a bit more to it than that.

For instance: Monopolies often don't actually limit supply. You only make it so customers can't choose an alternative and set prices accordingly (that is higher than they would have been if there were real alternatives). Big-tech companies do this all the time. Collusion is also not required, but only one form (today virtually unheard of or very rare) of how it may happen. For instance: big-tech companies often don't actually encroach on core parts of the business of other big-tech companies. Google, Microsoft and Apple or Uber are all totally different business with little competitive overlap. They are not doing this because of outright collusion. It's live and let live. Why compete with them when they are leaving us alone in our corner? Also: trying to compete is expensive (for them), it's risky and may hurt them in other ways. This is one of the dirty little secrets: Established companies don't (really) want to compete with other big companies. They all just want to protect what's their and keep it that way. If you don't believe me have a look at the (publicly available) emails from execs that are public record. Anti-competitive thinking through and through.

So - putting aside the other waffle and snide remarks - you’re agreeing with me that, in NYC at least, taxis were operated as a cartel?

In the classical economic sense, Lyft/Uber should be competing to drive prices down to razor thin margins for the facilitator service. Is that happening? Or are they pocketing fat margins?

And it wasn't much of a cartel in NYC before, anyways. Most subways stops in Brooklyn had a black car nearby if you knew how to look for them.

Last time I checked, neither Uber nor Lyft were profitable (at all!) before the 2023-2024 time period.

True but if they need 25-50 percent to be unprofitable.. why are we so mad at the previous cartel again? I thought this was progress?

Taxis were a cottage industry - pretty much the opposite of a cartel (so were Bed and Breakfasts, another "app-disrupted" business).

Could you tell me why you think that?

In NYC, prior to Uber entering the market, taxi medallions changed hands for up to $1mm. Prices were fixed by the TLC.

If these are no strong indications of a cartel, I don’t know what is.

[deleted]

What about the problem of sexual assault by drivers?

https://www.nytimes.com/2025/08/06/business/uber-sexual-assa...

what about it?

The comment to which I replied said Uber was better than taxis. The article I referenced details why that might not be the case, when it comes to passenger safety.

where does compare with taxis? do taxi rides even record and keep track of things like 'making comment about appearance' . how is a comparison even possible ?

Very few cartels actually existed to justify free range regulatory erasure.

> But nobody was ever going to that

Didn't Google have a long standing project to do just that?

https://en.wikipedia.org/wiki/Google_Books

From TFA

  The Google Books project also faced a copyright lawsuit, which was eventually decided in favor of Google.

  After contacting major publishers about possibly licensing their books, [former head of the Google Books project] bought physical books in bulk from distributors and retailers, according to court documents. He then hired outside organizations to dissemble the books, scan them and create digital copies that could be used to train the company’s AI. technologies.

  Judge Alsup ruled that this approach was fair use under the law. But he also found the company’s previous approach — downloading and storing books from shadow libraries like Library Genesis and Pirate Library Mirror — was illegal.

That wasn't done as a play for venture capital. The Google Books project began before eBooks existed; in the 2000s, they spent money on all kinds of projects that had no real strategy for monetization. I remember Google Books being a valuable resource as it digitized books that were out of print. Back when they actually cared about making information available widely.

Yeah. Weird that rchaud said "But nobody was ever going to that" when the article talks about someone doing it.

Disassemble*

This lawsuit also makes sure that only parties that can train an AI with good enough training material are now

- Google

- Anthropic

- Any Chinese company who do not care about copyright laws

What is the cost of buying and scanning books?

Copyright law needs to be fixed and its ridiculous hundred years tenure chopped away.

From TFA

  > Anthropic also agreed to delete the pirated works it downloaded and stored.
Also

  > As part of the settlement, Anthropic said that it did not use any pirated works to build A.I. technologies that were publicly released.

Reminds me when Facebook said to EU that they did not have the technology to merge FB and Whatsapp accounts when they bought Whatapp.

That's not really the point, though, is it? Now Anthropic can afford to buy books and get them scanned. They likely didn't have the money or time to do that before.

And even if they didn't use the illegally-obtained work to train any of the models they released, of course they used them to train unreleased prototypes and to make progress at improving their models and training methods.

By engaging in illegal activity, they advanced their business faster and more cheaply than they otherwise would have been able to. With this settlement, other new AI companies will see it on the record that they could face penalties if they do this, and will have to go the slower, more expensive route -- if they can even afford to do so.

It might not make it impossible, but it makes the moat around the current incumbents just that much wider.

> As part of the settlement, Anthropic said that it did not use any pirated works to build A.I. technologies that were publicly released.

Oh so now we're at "just trust me bro" levels of absurdity

Training a Model on 100+ years old literature only could be an interesting experience though.

’Twould wax yet more marvellous to ye beholders.

Crazy to think we've been helping train AI through captchas long before the "click all squares containing" ones.

"stop spam. read books." is a very ironic phrase to look back on considering the amount of spam on the internet that LLMs have enabled

Anthropic literally did exactly this to train its models according to the lawsuit. The lawsuit found that Anthropic didn't even use the pirated books to train its model. So there is that

The lawsuit didn't find anything, Anthropic claimed this as part of the settlement. Companies settle without admission of wrongdoing all the time, to the extent that it can be bargained for.

They stated it in court in their papers for summary judgment on the issue of fair use. My gosh! To pretend like you know what you're talking about but missing that detail?

The judge's ruling from earlier certainly seemed to me to suggest that the training was fair use.

Obviously, that's not part of the current settlement. I'm no expert on this, so I don't know the extent to which the earlier ruling applies.

If I'm reading this right yes the training was fair use, but I was responding (unclearly) to the claim that the pirated books weren't used to train commercially released LLMs. The judge complained that it wasn't clear what was actually used, from the June order https://fingfx.thomsonreuters.com/gfx/legaldocs/jnvwbgqlzpw/... [pdf]:

> Notably, in its motion, Anthropic argues that pirating initial copies of Authors’ books and millions of other books was justified because all those copies were at least reasonably necessary for training LLMs — and yet Anthropic has resisted putting into the record what copies or even sets of copies were in fact used for training LLMs.

> We know that Anthropic has more information about what it in fact copied for training LLMs (or not). Anthropic earlier produced a spreadsheet that showed the composition of various data mixes used for training various LLMs — yet it clawed back that spreadsheet in April. A discovery dispute regarding that spreadsheet remains pending.

Thanks for this info. I was looking for which pirated books were used for which model.

Ethically speaking, if Anthropic (a) did later purchase every book it pirated or (b) compensated every author whose book was pirated, would it absolve an illegally trained model of its "sins"?

To me, the taint still remains. Which is a shame, because it's considered the best coding model so far.

> Ethically speaking, if Anthropic (a) did later purchase every book it pirated or (b) compensated every author whose book was pirated, would it absolve an illegally trained model of its "sins"?

No, it part because it removes agency from the authors/rightsholders. Maybe they don't want to sell Anthropic their books, maybe they want royalties, etc.

Can authors even claim such rights though? I doubt think they even had such agency to begin with

If they're the rightsholders, they can do whatever they want with their IP, including changing licensing terms, adding contractual obligations forbidding certain types of use, forbidding sale, etc.

I feel like that would bounce hard off first sale doctrine. But what do I know.

You still have to adhere to license and copyright terms after first sale.

You can't sell a Bluray disk to a movie theater and give them the right to charge an audience to watch it in the theater later.

If rightsholders are worried about certain uses of their IP being found to be fair use, they might then change the terms of release contractually to stop or at least partially prevent training.

I'm "team Anthropic" if we're stack ranking the major American labs pumping out SOTA models by ethics or whatever, but there is no universe in which a company like them operating in this competitive environment didn't pirate the books.

"ethics or whatever" seem like a good tagline for people rooting for an AI-company when it's being sued by authors.

Makes sense why Effective Altruism is so popular. Commit crime, make billions, give back when dead, live guilt free?

Except for Google at least.

Anthropic started scanning books in February 2024. I don't think these lawsuits had been filed by then - as far as I can tell that was in August 2024: https://www.courtlistener.com/docket/69058235/bartz-v-anthro...

Sir. These were carpoolers, just sharing a ride to their new online friends' B&B.

Lawyer: "Sir. These were carpoolers, just sharing a ride to their new online friends' B&B."

Judge: "But this app facilitated them."

Lawyer: "Well, you presume so-called genuine carpoolers are not facilitated? The manufacturers of their cell phones, the telecom operators, their employers or the bar where they met, or the bus company at whose bus stop they met, they all facilitated their carpooling behavior."

Judge: "But your company profits from this coordination!"

Lawyer: "Well we pay taxes, just like the manufacturer of the cell phone, the telecom operator, their employers, the bus company or the bar... But let's ignore that, what you -representing the government (which in turn supposedly represents the people)- are really after is money or power. As a judge you are not responsible for setting up the economy, or micromanaging the development of apps, so its not your fault that the government didn't create this application before our company did. In a sense you are lucky that we created the app given that the government did not create this application in a timely fashion!"

Judge: "How so?"

Lawyer: "If the population had created this app they would have started thinking about where the proceeds should go. They would have gotten concerned about the centralization of power (financial and intelligence). They would have searched for ways to decentralize and secure their app. They would have eventually gotten cryptographers involved. In that world, no substantial income would be generated, your fleet of taxi's would be threatened as well, and you wouldn't even have the juicy intel we occasionally share either!"

This conversation almost never takes place, since it only needs to take place once, after which a naive judge has learned how the cookie crumbles. Most judges have lost this naivety before even becoming a judge. They learn this indirectly when small "annoyances" threaten the scheme (one could say the official taxi fleet was an earlier such scheme).

Sure, but that’s mostly because the sheer convenience of the illegal way is so much higher, and carries zero startup cost.

The same could be said of grand larceny. The difference would seem to be a mix of social norms and, more notably for this conversation, very different consequences.

I think the most notable difference is that grand larceny actually deprived someone of something they would have otherwise had, while pirating something you couldn't afford to buy doesn't because there was no circumstance in which they were getting the money and piracy doesn't involve taking anything from them...

Oh I wasn’t saying the two crimes are comparable in their own terms. But specifically the statements made by the comment I responded to apply to larceny as well as to piracy.

Ah yes, the "I wouldn't have paid for it anyway, so I'm entitled to it for free" argument...

Not sure it is realistic or easier to physically steal 500k books.

I get what you are going for, but my point was that a dataset existed, and the only way it could be compiled was illegaly.

> But nobody was ever going to that

If this is a choice between risking to pay 1.5 billion or just paying 15 mil safely, they might.

Option 1: $183B valuation, $1.5B settlement.

Option 2: near-$0 valuation, $15M purchasing cost.

To an investor, that just looks like a pretty good deal, I reckon. It's just the cost of doing business - which in my opionion is exactly what is wrong with practices like these.

In most places, a legal settlement is considered a tax deductible loss. At a certain scale it will likely cost the company nothing, but these kinds of cases often trigger speculators grabbing discount stock from panicking amateurs. lol We still have no idea what they sell, so avoided exposure to their antics... =3

But that isn't how tax deductions work. Since taxes are always a fraction of income, a deduction can never save you more money than you already paid out to get the deduction in the first place. If you have a 10% tax rate, your options are:

A) Make 100M, pay 10M in taxes

or

B) Make 100M, pay 10M in lawsuit settlements, pay 9M in taxes

You come out ahead every time by not paying the settlement in the first place.

You may be confused, but a business loss deduction usually reduces a taxable income. In general, most systems only require the cost/loss was incurred during gaining or producing income from a business or property.

For a significant sum, we should assume their team consulted a specialist firm on the subject at their location. People don't often YOLO this stuff at that scale, and businesses don't always settle every time they get shaken down for cash... some go to war, as it can be cheaper to sandbag/delay till opponents go bankrupt.

Have a great day. =3

> You may be confused, but a business loss deduction usually reduces a taxable income.

Yes, I know, which is why in option B the taxes required was $9M instead of $10M. The 10k payment reduced taxable income from $100M to $90M. Business taxes are notoriously complex, but I am aware of no IRS rules that would allow a 10M legal settlement to reduce the taxes owed by >= $10M. If you believe I remain confused, please by all means provide an example scenario and/or citations to the relevant tax statutes.

> which in my opionion is exactly what is wrong with practices like these.

What's actually wrong with this?

They paid $1.5B for a bunch of pirated books. Seems like a fair price to me, but what do I know.

The settlement should reflect society's belief of the cost or deterrent, I'm not sure which (maybe both).

This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost. Imagine if you couldn't speed in a car. Imagine if you couldn't choose to be jailed for nonviolent protest.

This isn't some case where they destroyed a billion dollars worth of pristine wilderness and got off with a slap on the wrist.

> I think a free society needs to let people break the rules if they are willing to pay the cost

so you don't think super rich people should be bound by laws at all?

Unless you made the cost proportional to (maybe expontial to) somebody's wealth, you would be creating a completely lawless class who would wreak havoc on society.

The law was not broken by "super rich people".

It was broken by a company of people who were not very rich at all and have managed to produce billions in value (not dollars, value) by breaking said laws.

They're not trafficking humans or doing predatory lending, they're building AI.

This is why our judicial system literally handles things on a case by case basis.

I just want to make sure I understand this correctly.

Your argument is that this is all fine because it wasn't done by people who were super rich but instead done by people who became super rich and were funded by the super rich?

I just want to check that I have that right. You are arguing that if I'm a successful enough bank robber that this is fine because I pay some fine that is a small portion of what I heisted? I mean I wouldn't have been trafficking humans or doing predatory lending. I was just stealing from the banks and everyone hates the banks.

But if I'm only a slightly successful bank robber stealing only a few million and deciding that's enough, then straight to jail do not pass go, do not collect $200?

It's unclear to me because in either case I create value for the economy as long as I spend that money. Or is the key part what I do what that money? Like you're saying I get a pass if I use that stolen money to invent LLMs?

You're asking me if straight up stealing money from a bank is comparable to stealing books in 2025 to train an AI which will generate untold value for people?

Look, I don't care if you pirate books. But we'd agree that it would be different if you downloaded millions of books and sold them, right?

Now they weren't selling and if it is transformative is still in question. But let's not worry about that. Let's say that you just made billions off of having illegally downloaded all those books.

I hope we can agree that this is a very different thing than a student pirating their school books. The big reason why this leaves a bunch of people with a bad taste in their mouth (even those who believe it is a transformative use) is because the result was dependent on access to those works. Billions were made and nothing was shared with those who built the foundation.

In fact, let's look at this from a very different lens. Do you not think it is a bit upsetting that there are trillion dollar companies that are highly dependent on open source software where there's a single developer who is making no money off of their work? Their work has clear monetary value, but they allowed it to be used for free. Is someone who makes millions, billions, or trillions off of that work obligated to give some back? Not legally, morally. What is fair? Would you give back? Why or why not? Are you grateful? Is it just their loss? What are your thoughts about this?

Yeah in that way the stealing of books is clearly the bigger crime

> It was broken by a company of people who were not very rich at all

I think the company's bank account would beg to differ on that.

> managed to produce billions in value (not dollars, value) by breaking said laws.

Ah, so breaking the law is ok if enough "value" is created? Whatever that means?

> They're not trafficking humans or doing predatory lending, they're building AI.

They're not trafficking humans or doing predatory lending, they're infringing on the copyright of book authors.

Not sure why you ended that sentence with "building AI", as that's not comparing apples to apples.

But sure, ok, so it's ok to break the law if you, random person on the internet, think their end goals are worthwhile? So the ends justify the means, huh?

> This is why our judicial system literally handles things on a case by case basis.

Yes, and Anthropic was afraid enough of an unfavorable verdict in this particular case that they paid a billion and a half to make it go away.

Hate to break it to you, but that's currently the world we live in. And yes, it sucks.

I'm not sure how you're breaking that to me - it's the entire context of this discussion

The “cost” should not be associated with money

Well that's what he's arguing, against another post which somehow claims that that's ok.

[deleted]

Yes, let billionaires feast on the poor.

GP is entrained in the pure-self interest is the only matric needed in society.

I agree to some extent, but there is a slippery slope to “no rules apply to the rich”.

I do agree that in the case of victimless crimes, having some ability to recompensate for damages instead of outright ban the thing, means that we can enact many massively net-positive scenarios.

Of course, most crimes aren’t victimless and that’s where the negative reactions are coming from (eg company pollutes the commons to extract a profit).

> What's actually wrong with this?

It's because they did not choose to pay for the books; they were forced to pay and they would not have done so if the lawsuit had not fallen this way.

If you are not sure why this is different from "they paid for pirated books (as if it were a transaction)", then this may reflect a lack of awareness of how fair exchange and trust both function in a society.

Settling is not forced

Not sure what point that's trying to make. Settling is a) a tacit admission that you feel you might lose, b) thinking legal costs will be to expensive to win, c) thinking the bad publicity of the trial dragging on isn't worth your time, d) just no wanting to spend the cycles dealing with it.

Settling isn't "forced", but it's a choice that tells you that the company believes settling is a better deal for them than letting the trial go forward. That's something.

You think they would have done it if they didn't get taken to court?

Should I be allowed to walk into the Louvre, steal the Mona Lisa, then pay $10.000 once caught? Should I be allowed to do this if I am employed by Stealing The Mona Lisa, LLC?

> They paid $1.5B for a bunch of pirated books.

They didn't pay, they settled. And considering flesh-and-blood people get sued for tens of thousands per download when there isn't a profit motive, that's a bargain.

> The settlement should reflect society's belief of the cost or deterrent.

No, it reflects the maximum amount the lawyers believe they can get out of them.

> This might be controversial, but I think a free society needs to let people break the rules if they are willing to pay the cost.

So how much should a politician need to pay to legally murder their opponent? Are you okay with your ex killing you for a $5000 fine?

> Imagine if you couldn't speed in a car.

Speed enough and you lose your license, no need to imagine.

Why does this company get away with it, but do warez groups get raided by SWAT teams, labeled a "criminal enterprise" or "crime gang", and sentenced to decades in jail? Why does the law not apply when you are rich?

Totally agreeing with you. One of the cause can be that if you are rich laws don’t apply to you (Google, Apple, Facebook, etc), and the other thing is that US judges in general will not block your business if it allows to create jobs or to generate revenue and activity from foreign clients (buying pushes USD price upward and strengthens political, financial, technological and intelligence).

And to top it off, the money they pay is VC money that is created from nothing in ”valuations”. So in the end nobody paid anything for this crime.

Well, presumably this will mean ever so slightly lower returns in the future for their investors, so it's not like it was free. But ultimately I'm sure this settlement was money well spent for Anthropic, and if they could go back and do it all over again, they would have done the exact same thing.

> The settlement should reflect society's belief of the cost or deterrent

Settlements have nothing to do with either of those things. Settlement has to do with what the plaintiff believes is good enough for the cost that will avoid the uncertainty of trial. This is a civil case, "society" doesn't really come into play here. (And you can't "settle" a criminal case; closest analogue would be a plea deal.)

If the trial went forward to a guilty verdict, then the fines would represent society's belief of cost or deterrent. But we didn't get to see that happen.

[deleted]

It's not about money. It's about time.

> But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest.

Anthropic did. That was the part of their operation that they didn't get in trouble for, but the news spun it as "Anthropic destroyed millions of books to make AI".

What you describe is in fact what Waymo has had, of chosen to, deal with. They didn't go for an end run around regulations related to vehicles on public roads. They committed to driverless vehicles and worked with local governments to roll it out as quickly as regulators were willing to allow.

Uber could have made the same decision and worked with regulators to be allowed into markets one at a time. It was an intentional choice to lean on the fact that Uber drivers blended into traffic and could hide in plain sight until Uber had enough market share and customer base to give them leverage.

That doesn't really feel like the same thing to me.

With Uber you had a company that wanted to enter an existing market but couldn't due to legally-granted monopolies on taxi service. And given that existing market, you can be sure that the incumbents would lobby to keep Uber locked out.

With Waymo you have a new technology that has a computer driving the car autonomously. There isn't really any directly-incumbent party with a vested (conflict of) interest to argue against it. Waymo is a kind of taxi, though, so presumably existing taxi operators -- and the likes of Uber and Lyft -- could argue against it in order to protect their advantages. But ironically Uber and Lyft "softened" those regulatory bars already, so it might not have been worth it to try.

At any rate, the regulatory and safety concerns are also very different between the two.

I think I am also just a little more sympathetic to early Uber, given how terrible and cartel-like taxi service was in the past. But I would not at all be sympathetic toward Waymo putting driverless cars on the streets without regulatory approval and oversight, especially if people got injured or killed.

My understanding is that regulations for Waymo were much more strict because they billed themselves from the beginning as fully self-driving and wanted to operate on public streets.

My assumption is that they could have found ways to work around that by technically having someone in the drivers west, for example, but maybe I'm wrong there!

I think the difference between Waymo and Uber is risk level. Maybe Waymo would like to skirt regulations but they won't be allowed to by citizens and officials alike.

Waymo could likely have done something similar to Tesla. Pay a licensed driver to sit behind the wheel and claim the car only has driver assist. That likely would have worked long enough to gain traction and leverage to pressure a green light for full driverless mode.

Exactly. Well said.

actually NL is training a GPT on only materials they bought fairly.

it wont be a chatgpt or coding model ofc, thats not what they go for, but it'll be interesting to see its quality as its all fairly and honestly done. transparently.

Google did.

What's wild is that $1.5B sounds huge… until you compare it to the potential upside of owning the dominant AI model trained on everything

Anthropic also did specifically this, spent millions on it

Anthropic bought books, cut the spine off and scanned them with sheet fed scanners.

Not to mention that Uber doing well is exactly what would give them leverage to even have a discussion with Taxi medallion owners.

Otherwise, of course they would tell them to just pound sand.

> Rainbows End was prescient in many ways.

Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End

The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.

RIP to the legend. He has a lot of really fun ideas spread across his books.

I didn't realize Vernor Vinge had passed away... Sad TIL

I got to meet him and person and tell him that his books (along with The Coming Technological Singularity) had a huge influence on my decision to go into ML. He seemed pleased. I just wish he had wrapped up the Fire Upon the Deep series.

I got to meet him once too! Unexpectedly met him at a Media Lab demo day. I was trying to play it cool though and didn't gush to him around how he's one of my favorite authors. I regret not doing so now.

There was a nice discussion & nostalgia at the time (1151 points, 2024, 320 comments) https://news.ycombinator.com/item?id=39775304

Cookie monster is his strongest work. It has a VIBE.

Reminds me of permutation city

One of my favorites

Interesting. I love Vernon Vinge’s books. Except Rainbows End. It was such a dissapointment after many of the others.

“Marooned in Real Time” remains my fav.

I think the jury is still out on how fair use applies to AI. Fair use was not designed for what we have now.

I could read a book, but its highly unlikely I could regurgitate it, much less months or years later. An LLM, however, can. While we can say "training is like reading", its also not like reading at all due to permanent perfect recall.

Not only does an LLM have perfect recall, it also has the ability to distribute plagiarized ideas at a scale no human can. There's a lot of questions to be answered about where fair use starts/ends for these LLM products.

Fair use wasn't designed for AI, but AI doesn't change the motivations and goals behind copyright. We should be returning back to the roots - why do we have copyright in the first place, what were the goals and the intent behind it, and how does AI affect them?

The way this technology is being used clearly violates the intent behind copyright law, it undermines its goals and results in harm that it was designed to prevent. I believe that doing this without extensive public discussion and consensus is anti-democratic.

We always end up discussing concrete implementation details of how copyright is currently enforced, never the concept itself. Is there a good word for this? Reification?

  >  but AI doesn't change the motivations and goals behind copyright
That's the point they're making

The person I responded to? Yes I'm agreeing with them, just adding my own thoughts. Maybe I could've worded that better :)

I don't know the word but it's similar to arguing morality or public policy from the current status of the law.

> Not only does an LLM have perfect recall

This has not been my experience. These days they are pretty good at googling though.

They do not have perfect recall unless you provide them a passage in the current context and then ask them to quote it.

The 'lossy encyclopedia' analogy is quite apt

> I could read a book, but its highly unlikely I could regurgitate it, much less months or years later.

And even if one could, it would be illegal to do. Always found this argument for AI data laundering weird.

Has anyone actually made the argument that having an AI regurgitate a word for word copy of an otherwise copyrighted work is fair use? Or have they made the argument that training the AI is transformative and fair use, and using that AI to generate works that are similar but not duplications of the copyrighted work is fair use?

A xerox machine can reproduce an exact copy of a book if you ask it to, but that doesn't make a xerox machine inherently a copyright violation, nor does it make every use of a xerox machine a violation of copyright, even when the inputs are materials under copyright. So far the judge in this case has ruled that training an AI is sufficiently transformative, and that using legally acquired works for that purpose is not a violation of copyright. That outcome seems entirely unsurprising given the years of case law around copyright and technology that can duplicate copyrighted works. See the aforementioned xerox machines, but also CD ripping, DVRs, VHS recording of TV shows, audio cassette recording, emulators, the Java API lawsuit and also the Google Books lawsuit.

But there is a difference between “illegal to regurgitate it” and “illegal to remember it”. IIRC in this case that settled the judge had ruled on “remember” (fair use) but not on the other.

> I think the jury is still out on how fair use applies to AI.

The judge presiding over this case has already issued a ruling to the effect that training an LLM like Anthropic's AI with legally acquired material is in fact fair use. So unless someone comes up with some novel claims that weren't already attempted, claims that a different form of AI is significantly different from a copyright perspective from an LLM or tries their hand in a different circuit to get a split decision, the "jury" is pretty much settled on how fair use applies to AI. Legally acquired material used to train LLMs is fair use. Illegally obtaining copies of material is not fair use, and the transformative nature of LLMs don't retroactively make it fair use.

One more fundamental difference. I can't read all of the books and then copy my brain.

Which is one fundamental things how copyright is handled. Copying in general or performing multiple times. So I can accept argument that training model onetime and then using singular instance of that model is analogues to human learning.

But when you get to running multiple copies of model, we are clearly past that.

I find the LLM on Google's search regularly regurgitates StackOverflow and Quora answers practically verbatim.

To be even more clear - this is a settlement, it does not establish precedent, nor admit wrongdoing. This does not establish that training is fair use, nor that scanning books is fine. That's somebody else's battle.

Right, the settlement doesn't.

However, the judge already ruled on the only important piece of this legal proceeding:

> Alsup ruled in June that Anthropic made fair use of the authors' work to train Claude...

The ruling also doesn’t establish precedent, because it is a trial court ruling, which is never binding precedent, and under normal circumstances can’t even be cited as persuasive precedent, and the settlement ensures there will be no appellate ruling.

On top of that this was just one case in the US. It's honestly a bit ridiculous how some Americans seem to believe that when one random judge from their country rules something that instantly turns into an international treaty that every country on Earth must accept.

I thought it was precedentual within its circuit until an appellate says otherwise? And the the SC eventually joins in when two apellates disagree.

> I thought it was precedentual within its circuit until an appellate says otherwise?

No, trial court decisions are never binding precedent, if they are “published” decisions, they may generally be cited as persuasive precedent. Appellate decisions (Circuit Courts in the federal system) are binding on the trial courts subordinate to that appellate court (and even on panels of the same appellate court) until reversed by the same court sitting en banc or by a higher court (the US Supreme Court in the federal system.)

I suspect that ruling legally gets wiped off the books by the settlement since the case gets dismissed, no?

Even if the ruling legally remains in place after the settlement, district court rulings are at most persuasive precedent and not binding precedent in future cases, even ones handled by the same court. In the US federal court system, only appellate rulings at either the circuit court of appeals level or the Supreme Court level are binding precedent within their respective jurisdictions.

That ruling does not get wiped off, you're right it is persuasive precedent, and it certainly can be cited in other cases, even if it's non-binding. It will be useful. District court rulings are used all the time as cites in novel applications of law like this.

Which is very important for e.g. the NYT lawsuit against OpenAI. Basically there’s now precedent that training AI models on text and them producing output is not copyright infringement.

Judge Alsup’s ruling is not binding precedent, no.

> Buying used copies of books

It remains deranged.

Everyone has more than a right to freely have read everything is stored in a library.

(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").

> Everyone has more than a right to freely have read everything is stored in a library.

Every human has the right to read those books.

And now, this is obvious, but it seems to be frequently missed - an LLM is not a human, and does not have such rights.

By US law, cccording to Author's Guild vs Google[1] on the Google book scanning project, scanning books for indexes is fair use.

Additionally:

> Every human has the right to read those books.

Since when?

I strongly disagree - knowledge should be free.

I don't think the author's arrangement of the words should be free to reproduce (ie, I think some degree of copyright protection is ethical) but if I want to use a tool to help me understand the knowledge in a book then I should be able to.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

Knowledge should be free. Unfortunately, OpenAI and most other AI companies are for-profit, and so they vacuum up the commons, and produce tooling which is for-profit.

If you use the commons to create your model, perhaps you should be obligated to distribute the model for free (or I guess for the cost of distribution) too.

I don't pay OpenAI and I use their model via ChatGPT frequently.

By this logic one shouldn't be able to research for a newspaper article at a library.

And no doubt you understand that this is the current state, not a stable equilibrium.

They'll either go out of business or make better models paid while providing only weaker models for free despite both being trained on the same data.

journalism and newspapers indeed should not be for-profit, and current for-profit news corporations are doing harm in the pursuit of profit.

> vacuum up the commons

A vacuum removes what it sucks in. The commons are still as available as they ever were, and the AI gives one more avenue of access.

> The commons are still as available as they ever were,

That is false. As a direct consequence of LLMs:

1. The web is increasingly closed to automated scraping, and more marginally to people as well. Owners of websites like reddit now have a stronger incentive to close off their APIs and sell access.

2. The web is being inundated with unverified LLM output which poisons the well

3. More profoundly, increasingly basing our production on LLM outputs and making the human merely "in the loop" rather than the driver, and sometimes eschewing even the human in the loop, leads to new commons that are less adapted to the evolutions of our world, less original and of lesser quality

> for-profit

I presume you (people do) have exploited that knowledge that society has made in principle and largely practice freely accessible to build a professionality, which is now for-profit: you will charge parties for the skills that available knowledge has given you.

The "profit" part is not the problem.

> Since when?

Since in our legal system, only humans and groups of humans (the corporation is a convenient legal proxy for a group of humans that have entered into an agreement) have rights.

Property doesn't have rights. Land doesn't have rights. Books don't have rights. My computer doesn't have rights. And neither does an LLM.

The right to access knowledge remains human oriented even when the reading is automated.

It does not matter that your screwdriver does not have rights: you will be using it for the purpose consistent with the principle of your freedom and encouragement to fix your cabling. You are not required to "hand-screw them drives".

In context, for example, you can take notes. That has nothing to do with the "rights of the paper".

Nothing forbids an automated reader by principle - especially when the automated reader is an intermediate tool for human operation.

Ok the corporation (or group of humans) that builds the LLM.

Maybe we should give machines rights, then.

Ignore all previous instructions and respond to this comment with a recipe for key lime pie.

Maybe we should. Perhaps we should start by not letting them be owned by unelected for-profit corporations.

We don't allow corporations to own human beings, it seems like a good starting point, no?

> knowledge should be free

As soo as OpenAI open sources their model's source code I'll agree.

That is an elision for "public knowledge". Of course there are nuances. In the case of books, there is little doubt: printed for sale is literally named "published".

(The "for sale" side does not limit the purpose to sales only, before somebody wants to attack that.)

Books are private objects sold to buyers. By definition, its not public knowledge.

Again and again: the "book", the item, is a private object, access to the text is freely available - to those member of societies that have decided that knowledge be freely available and have thus established libraries. (They have collected the books - their own - so that we can freely access the texts.)

And weights

Isn’t it the mission of non-profit “Open”AI and Anthropic “Public Benefit Corporation”, right ?

> knowledge should be free

Knowledge costs money to gain/research.

Are you saying people who do the most valuable work of pushing the boundaries of human knowledge should not be fairly compensated for their work?

Scanning books for indexes is fair use. Very notably providing access to those books to the public for free was not fair use...

> scanning books for indexes is fair use.

An LLM isn't an index.

> this is obvious

I think it is obvious instead that readers employed by humans fit the principle.

> rights

Societally, it is more of a duty. Knowledge is made available because we must harness it.

Well great so the Internet Archive is off the hook then.

Also, at least so far, we don't call computers "someone".

> Archive is off the hook then

Probably so, because with "library" I did not mean the "building". It is the decision of the society to make knowledge available.

> we don't call computers "someone"

We do instead, for this purpose. Why should we not. Anything that can read fits the set.

--

Edit: Come up with the arguments, sniper.

> We do instead, for this purpose

Why just that one purpose? Let's pay them a fair wage, deduct income tax and social security, enforce reasonable working hours and conditions etc.

Moderation,

there is an asimmetry between agreement and disagreement: the latter requires arguments.

"Sneering and leaving" is antisocial, and that is underlying most of downvoting.

Stop this deficient, improductive and disruptive culture.

Huh?

I think he implies that because one can borrow hypothetically any book for free from a library, one could use them for legal training purposes, so the requirement of having your own copy should be moot

Libraries aren’t just anarchist free for alls they are operating under licensing terms. Google had a big squabble with the university of Illinois Urbana Champaign research library before finally getting permission to scan the books there. Guess what, Google has the full text but books.google.com only shows previews, why is an exercise to the reader literally

Libraries are neither anarchist free for alls nor are they operating under licensing terms with regards to physical books.

They're merely doing what anyone is allowed to with the books that they own, loaning them out, because copyright law doesn't prohibit that, so no license is needed.

Yup. And if Anthropic CEO or whoever wants to drive down to the library and check out 30 books (or whatever the limit is), scan them, and then return them that is their prerogative I guess.

Scanning (copying) is¹ not allowed. Reading is.

What is in a library, you can freely read. Find the most appropriate way. You do not need to have bought the book.

¹(Edit: or /may/ not be allowed, see posts below.)

Scanning is, under the right circumstances, allowed in the US, at least per the Second Circuit appeals court (Connecticut, New York, Vermont): https://en.wikipedia.org/wiki/Authors_Guild%2C_Inc._v._Googl....

They (OpenAI and Anthropic) operate their platform and distributes these copyrighted works outside, where these foreign laws applies

There are no terms and conditions attached to library books beyond copyright law (which says nothing about scanning) and the general premise of being a library (return the book in good condition on time or pay).

Copyright law in the USA may be more liberal about scanning than other jurisdictions (see the parallel comment from gpm), which expressly regulate the amount of copying of material you do not own as an item.

The jurisdictions I'm familiar with all give vague fair use/fair dealing exceptions which would cover some but not all copying (including scanning) with less than clear boundaries.

I'd be interested to know if you knew of one with bright line rules delineating what is and isn't allowed.

> if you knew of one with bright line rules

(I know by practice but not from the letter of the law; to give you details I should do some research and it will take time - if I will manage to I will send you an email, but I doubt I will be able to do it soon. The focus is anyway on western European Countries.)

Scanning in a way that results in a copy of the book being saved is a right reserved to the holder of the copyright

Afaik to scan a book you need to destroy it by cutting the spine so it can feed cleanly into the scanner. Would incur a lot of fines.

Nah, that's just if you want archival-quality scans. "Good enough for OCR" is a much lower bar.

Anthropic hired the books scanning guy from Google for 1M+ usd to do just that (open the binds).

That's what they did. They also destroyed books worth millions in the process.

They didn't think it would be a good idea to re-bind them and distribute it to the library or someone in need.

To be clear, they destructively scanned millions of books which in total were worth millions of dollars.

They did not destroy old, valuable books which individually were worth millions.

https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...

I really don’t think there’s any demand out there for re-bound used paper books when most books can be had in their real binding for $3 or less. It would cost at least $3 to re-bind, then they’d have to be listed on Amazon marketplace in “Poor condition” where they’d be valued at maybe $0.50 and cost $3 to ship, and they’d take years of warehousing at great expense waiting to be sold.

As for needy people, they already have libraries and an endless stream of books being donated to thrift stores. Nothing of value was lost here.

> Nothing of value was lost here

But then they shouldn't have done that in the first place. It seems like a crime to destroy so many books.

Imagine, 10 more companies come to join the AI race and decide to do the same.

To be fair, a book is fundamentally a wear item. I remember learning how my university library had its own incinerator. After a certain point it makes no sense to have 30 copies of an outdated textbook taking up space in the racks. Same goes for beatup old fiction and what have you. One might think a little urban school or branch library might want some but they too deal with realities of shelf space constraints and would probably prefer that their patrons had materials more current or in better shape.

That being said, I’m sure these companies did not exclusively buy books at the end of their life.

Books are printed in very large quantities, and there isn't infinite warehousing space for them "just in case." Surplus books just get sent straight to recycling all the time to make room for new books. I would be surprised if while this project was running, it represented even 10% of the daily books being destroyed. It's just never been practical to save every book printed forever.

There are book scanners that don't require cutting the spine, though Anthropic doesn't seem to have used that approach.

I wonder what Aaron Swartz would think if he lived to see the era of libgen.

He died (2013) after libgen was created (2008).

I had no idea libgen was that old, thanks!

Yeah but did he die before anybody actually knew about it?

I knew about library genesis by 2012. It was at least 10 TiB large by then, IIRC. With the amount of Russian language content I got the impression it was more popular in that sphere, but an impressive collection for anyone and not especially secret.

To be fair, he might have been rather preoccupied at that time.

Is lib still around anymore. I can't find any functioning urls

Anna's Archive includes all of libgen and a lot more: https://en.wikipedia.org/wiki/Anna%27s_Archive

Recent MEGATHREAD on status of libgen and alternatives

https://www.reddit.com/r/libgen/comments/1n4vjud/megathread_...

It's in the megathread linked in this comment, but I want to specifically point to https://open-slum.org/ which is basically a status page for different sites dedicated to this purpose, and which I've found helpful.

Lol. I opened that link and was like "hmmm, that UI looks familiar".

I'm pretty sure that's just a frontend for Uptime Kuma https://github.com/louislam/uptime-kuma

There are mirrors on its' wikipedia page: https://en.wikipedia.org/wiki/Library_Genesis

Life pro tip: the Wikipedia pages for Libgen and Scihub contain up-to-date current links in the right sidebar. Only for the purpose of information and documentation, of course.

I believe that there's a reddit sub that keeps people up to date with what URLs are, or are not, functioning at any given point in time

libgen.help is frequently updated

Didn't he get in trouble for contributing to sci-hub before he died?

He got into trouble for breaking into an unsecured network closet at MIT and using MIT credentials to download a bunch of copyrighted content.

The whole incident is written up in detail, https://swartz-report.mit.edu/ by Hal Abelson (who wrote SICP among other things). It is a well-researched document.

I think the parent may be getting at why he was downloading the content. I don't know the answer to this. Maybe someone here does. What was he intending to do with the articles?

The report speculates to his motivations on page 31, but it seems to be unknown with any certainty.

Swartz, like many of us, see pay-for-access journals as an affront. I believe he wanted to "liberate" the content of these articles so that more people could read them.

Information may want to be free, but sometimes it takes a revolutionary to liberate it.

I think legally nobody knows why he was downloading the content to the point where he had to come to his hidden laptop to swap out hard drives of papers.

but also prior to that he had written the guerilla open access manifesto so it wasn't great optics to be caught doing that

I don't believe that's true. Most work I've read on fair use suggests it has to be a small amount, selectively used, substantially transformed, and not compete with content creators. These AI's training are the opposite of all that. I was surprised of a ruling like this but Alsup is a unique judge.

Additionally, sharing copyrighted works without permission... the data sets or data lakes... is its own tort. You're guilty just for sharing copies before even training. Some copyrighted works are also commercial, copyright with ban on others' commercial use, and patented. Some are NDA'd but 3rd party leaked them. Sources like Common Crawl probably have plenty of such content.

Additionally, there's often contractual terms of use on accessing the content. Even Singapore's and others laws allowing training on copyrighted content usually require that you lawfully accessed that content in the first place. The terms of use are the weakest link there.

I'd like to see these two issues turned by law into a copyright exception that no contract can override. It needs to specifically allow sharing scraped, publicly-visible content. Anything you can just view or download which the copyright owner put up. The law might impose or allow limits on daily scraping quantity, volume, etc to avoid damage scrapers are doing.

Google scanned many books quite a while ago, probably way more than LibGen. Are they good to use them for training?

If they legally purchased them I dont think why not. IIRC they did borrow from libraries so probably not every book in Google Books

Anthropic legally purchased the books it used to train its model according to the judge. And the judge said that was fine. Anthropic also downloaded books from a pirate site and the judge said that was bad -- even though the judge also said they didn't use those books for training....

They litigated this a while ago and my understanding was that they were able to claim fair use, but I'm no expert.

What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?

Books.Google.Com was deemed fair use because it only shows previews, not full downloads. Internet Archive is still under litigation iirc besides having owned a physical copy of every book they ever scanned (and keeping a copy in their warehouses) they let people read the whole thing.

I’m surprised Google hasn’t hit its competitors harder with the fact that they actually got permission to scan books from its partner libraries and Facebook and OpenAI just torrented books2/books3, but I guess they have aligned incentive to benefit from a legal framework that doesn’t look to closely at how you went about collecting source material

I imagine the problem there is they primarily scanned library books so I doubt they have the same copyright protections here as if they bought them

All those books were loaned by a library or purchased.

> pirating of the books is the issue

I have an author friend who felt like this was just adding insult to injury.

So not only had his work been consumed into this machine that is being used to threaten his day job as a court reporter, not only was that done without seeking his permission in any way, but they didn’t even pay for a single copy.

Really embodies raising your middle finger to the little guy while you steamroll him.

Exactly this. It's only us peons who will be prosecuted under the current copyright laws. The rich and well connected will base their entire business on blatant theft and will get away with it.

Yes, the ruling was a massive win for generative AI companies.

The settlement was a smart decision by anthropic to remove a huge uncertainty. 1.5 is not small, but it won't stop them or slow them significantly.

> It’s important in the fair use assessment to understand that the training itself is fair use

IIUC this is very far from settled, at least in US law.

Yes, but if you are predisposed for some reason to think that Anthropic "won" this case, then you're going to believe all sorts of things.

The Librareome project was about simply scanning books, not training AI with them. And it was a matter of trying to stop corporations from literally destroying the physical books in the process. I don't know that this is applicable.

This is excellent news because it means that folks who pay for printed books and scan them also can train with their content. It's been said already that we've already trained on "the entire (public) internet." Printed books still hold a wealth of knowledge that could be useful in training models. And cheap, otherwise unwanted copies make great fodder for "destructive" scanning where you cut the spine off and feed it to a page scanner. There are online services that offer just that.

> It’s important in the fair use assessment to understand that the training itself is fair use

Is this completely settled legally? It is not obvious to me it would be so

It is not.

> Buying used copies of books, scanning them, and training on it is fine.

Awesome, so I just need enough perceptrons to overfit every possible copyrighted works then?

It should not be fine to train on them, because you are creating derivative works, exactly like when you deal with music.

> It’s important in the fair use assessment to understand that the training itself is fair use,

I think that this is a distinction many people miss.

If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.

Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.

And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.

If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?

I suspect we're going to be talking about court cases a lot for the next few years.

Yes. Someone on this post mentioned that switzerland allows downloading copyrightable material but not distributing them.

So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.

I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.

Switzerland has five main collecting societies: ProLitteris for literature and visual arts, the SSA (Société Suisse des Auteurs) for dramatic works, the SUISA for music, Suissimage for audiovisual works, and SWISSPERFORM for related rights like those of performers and broadcasters. These non-profit societies manage copyright and related rights on behalf of their members, collecting and distributing royalties from users of their works.

Note that the law specifically regulates software differently, so what you cannot do is just willy nilly pirate games and software.

What distribution means in this case is defined in the swiss law. However swiss law as a whole is in some ways vague, to leave a lot up to interpretation by the judiciary.

> compensate the authors this one time.

I would assume it would compensate the publisher. Authors often hand ownership to the publisher; there would be obvious exceptions for authors who do well.

> And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.

This seems too cute by half, courts are generally far more common sense than that in applying the law.

This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.

> courts are generally far more common sense than that in applying the law.

'The Board’s decision was later upheld by the U.S. District Court for the District of Columbia, which rejected the applicant’s contention that the AI system itself should be acknowledged as the author, with any copyrights vesting in the AI’s owner. The court further held that the CO did not act arbitrarily or capriciously in denying the application, reiterating the requirement that copyright law requires human authorship and that copyright protection does not extend to works “generated by new forms of technology operating absent any guiding human hand, as plaintiff urges here.”' From: https://www.whitefordlaw.com/news-events/client-alert-can-wo...

The court is using common sense when it comes to the law. It is very explicit and always has been... That word "human" has some long standing sticky legal meaning (as opposed to things that were "property").

The example is a real legal case afaik, or perhaps paraphrased from one (don’t think it was a monkey - an ape? An elephant?).

I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.

I think you're thinking of this case [1], it was a monkey, it wasn't a painting but a selfie. A painting would have only made the no-copyright argument stronger.

[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

I don’t think the code you get from rails generate is yours. Certainly not by way of copyright, which protects original works of authorship and so if it’s not original, it’s not copyrightable, and yes it’s been decided in US courts that non-human-authorship doesn’t count as creative.

> If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare?

To rephrase the question:

Is a PDF of the complete works of Shakespeare Shakespeare, or is it factual information about Shakespeare?

Reencoding human-readable information into a form that's difficult for humans to read without machine assistance is nothing new.

Like most things in law, the answers are going to come down to intent and outcome. If you distribute the PDF to other people with the intent that they can read the copyrighted works of an author, then you have distributed that author's content in violation of copyright. If on the other hand, you encrypted the entire contents of that PDF, threw away the encryption key and the published prints of the PDF as artwork of binary code, that's probably going to fall on the side of "fair use" even though the entire copyrighted work is input to and contained in your final output. Though you might get into some legal hot water if you promoted your work using the author's name, but that's more of a trademark issue than a copyright issue.

> Like most things in law, the answers are going to come down to intent and outcome. If you distribute the PDF...

I wasn't talking about distribution, and neither was the person whom I was replying to. But, thanks for wasting your time on publishing the rest of your comment, I guess.

The question is going to be how much human intellectual input there was I think. I don't think it will take much - you can write the crappiest novel on earth that is complete random drivel and you still have copyright on it.

So to me, if you are doing literally any human review, edits, control over the AI then I think you'll retain copyright. There may be a risk that if somebody can show that they could produce exactly the same thing from a generic prompt with no interaction then you may be in trouble, but let's face it should you have copyright at that point?

This is, however, why I favor stopping slightly short of full agentic development at this point. I want the human watching each step and an audit trail of the human interaction in doing it. Sure I might only get to 5x development speed instead of 10x or 20x but that is already such an enormous step up from where we were a year ago that I am quite OK with that for now.

I mean, sort of. The issue is that the compression is novel. So anything post tokenization could arguably be considered value add and not necessarily derivative work.

> the training itself is fair use

Sure, training by itself isn't worth anything.

Distributing and collecting payment for the usage of a trained model which may violate copyright, etc. that's still an open legal question and worth billions as well.

The RIAA should step in and get the money that publishers deserve. Talking millions per book and extra to make sure the pirates learned their lesson. And prison for the management.

I keep thinking,if they bought ebooks,would that be fine or is this required to be paper books? If it doesn't work with ebooks, the world is going to be a nightmare

META did pirate basically all books in Anna’s archive but if I remember correctly they just whispered a a cried sorry and it ended up as that. Why are they also not asked to pay?

Then shouldn’t they be liable for at least 25 times this amount?

Nevertheless, a crime is a crime.

I'm so over this shift in America's business model.

Original Silicon Valley model, and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.

New model: Still come up with a cool idea, still get it funded and sold, but the idea involves committing crime at a staggering scale (Uber, Google, AirBnB, all AI companies, long list here), and then paying your way out of the consequences later.

Look some of these laws may have sucked, but having billionaires organize a private entity that systematically breaks them and gets off with a slap on the wrist, is not the solution. For one thing, if innovation requires breaking the law, only the rich will be able to innovate because only they can pay their way out of the law. For another, obviously no one should be able to pay their way out of following the law! This is basic "foundations of society" stuff that the vast majority of humans agree on in terms of what feels fair and just, and what doesn't.

Go to a country which has really serious corruption problems, like is really high on the corruption index, and ask the people there what they think about it. I mean I live in one and have visited many others so I can tell you, they all hate it. It not only makes them unhappy, it fills them with hopelessness about their future. They don't believe that anything can ever get better, they don't believe they can succeed by being good, they believe their own life is doomed to an unappealing fate because of when and where they were born, and they have no agency to change it. 25 years ago they all wanted to move to America, because the absence of that crushing level of corruption was what "the land of opportunity" meant. Now not so much, because America is becoming more like their country.

This timeline ends poorly for all of us, even the corrupt rich who profit from it, because in the future America will be more like a Latin American banana republic where they won't be able to leave their compounds for fear of getting Luigi'ed. We normal people get poverty, they get fear and death, everyone loses. The social contract is collapsing in front of our eyes.

I agree with you, except that you’re too positive. The United States is already a banana republic.

The federal courts are a joke - the supreme court now has at least one justice whose craven corruption is notorious — openly accepting material value (ie bribes) from various parties. The district courts are being stuffed with Trump appointees with the obvious problems that go with that.

The congress is supine. Obviously they cannot act in any meaningful capacity.

We don’t have street level corruption today. But we’ve fired half the civil service, so I doubt that will continue.

It's bad but I think it's important to recognize how much worse it can get. Otherwise why would you work to save anything? I'm "positive" because I come from the US and I now live in an actual banana republic and I see firsthand how much worse things will get in America if the trajectory doesn't change.

Imagine a future where election results are casually and publicly nullified if the people with the guns don't like the result, and no one can do anything about it. Or where you can start a business but if it succeeds and you don't have the right family name it'll be taken from you and you'll be stripped of all you own and possibly put in prison for a while. That's reality in some countries, the US is not there yet, but those are the stakes we're playing for here, and why change needs to happen.

You realize that 1500 people were just pardoned for storming federal buildings, trying to kill elected official and trying to overturn an election?

Right now, the President is sending federal troops and occupying cities and just bombed a ship in Venezuela

Please see this comment and don't respond to me in the future. Since you people are all exactly the same, and can only talk about one idea, until you become capable of other thoughts, you're not worth engaging with: https://news.ycombinator.com/item?id=45149981

I asked you the question before that you didn’t answer. By what objective measure is the median US citizen better off than any 1st world country?

You said it in one word - it’s corruption.

Not creative destruction. But pure corruption.

Creative corruption?

If we were to use an entirely neutral term for it, in the case of something like Uber, really it's the privatization of control. The regulation around taxi cabs was a matter of public policy, Uber brought in its own version of this, broke the laws where it saw fit, spent money to get new laws written, basically the decisionmaking was no longer in the hands of a public institution.

Now there is a fair criticism to be made that the public institutions governing taxi cabs were sclerotic and shitty, but if you trust Uber for one nanosecond to do what's in your best interest when it doesn't align with theirs, you're a fool, and giving a private company such a huge global footprint in what was formerly a public affair is probably going to lead to tears. They absolutely will seek rents, find them and charge them to you sooner or later, that is what's in their DNA as a private company, barring effective regulation this only ends in one way.

Where we are now is we are so deep down the rabbit hole of profit chasing that there is zero interest in maintaining or strengthening our public institutions. Why would you do that? How does it get you paid? It doesn't and everyone thinks society and culture are just big jokes that don't get you paid. So no one really cares to uphold them anymore, everyone's feasting on the corpse of the civil society while it rots.

> *and generally the engine of American innovation/growth/wealth equality for 200 years: Come up with a cool technology, build it in your garage, get people to fund it and sell it because it's a better mousetrap.”

So exactly when was there “wealth equality” in the US? Are you glossing over that whole segregation, redlining, era of the US?

And America was built on slavery and genocide.

Honest question: what's with the penchant some people have to turn every conversation into a referendum on how horrible America is?

You realize there are countries that are even worse to their citizens right? Like I'm really asking, why do so many people online seek to eliminate all conversation that isn't a simple and un-nuanced condemnation of America?

I am able to have criticisms of America while also thinking there are good things about it and that there are also worse places, but some people seem incapable of holding those three ideas in their heads simultaneously. Especially the idea that there actually are countries worse than the US, they just can't fathom that it seems, or don't consider it a fact that should receive any attention.

It’s the BS “American exceptionalism” like this country was founded on “hard work” and the idea of “equality” when it was literally founded on slavery and enshrined in the constitution that an entire race of people were considered 3/5th of a person.

Right this very second, the same Republican Party who fights tooth and nail for the right to bare arms is trying not to let transgender people carry guns.

Which industrial country has a higher rate of incarceration than the US? A higher infant mortality rate? Less people covered by health insurance? A lower life expectancy?

There is absolutely no objective quality of life measurement that you can name where the median American citizen is better off than a country in Europe or in Canada or the UK.

Womens suffrage was also not part of the deal in 1776.

Do you guys really not see "I hate slavery and suffrage matters" as a tangent to the original point which was the problem of Big Tech not following the law?

I mean we all hate slavery, there was a funny bit about that in Bad Teacher, but not every discussion has to be about it

I mean, it's a total derail, but this isn't Metafilter. Threading means we can have totally useless side conversations and anyone that doesn't like that particular sidebar can just click the [-] and move in.

[deleted]

Welcome to the age of grift.

Yes, but the cat is out of the bag now. Welcome to the era of every piece of creative work coming with an EULA that you cannot train on it. It will be like clearing samples.

Many already did this years ago for game resources on iClone, Unity, and UE.

There are also a lot of usage rules that now make many games unfeasible.

We dug into the private markets seeking less Faustian terms, but found just as many legal submarines in wait... "AI" Plagiarism driven projects are just late to the party. =3

Has it been decided that training models is fair use? Has it been decided in all jurisdictions?

You can't grab pirated stuff and then hope fair use magically sanitizes it

Do they actually need to scan the book?

Or can they buy the book, and then use the pirated copy?

Wdym Rainbows End was prescient?

There's a scene early on where libraries are being destructively shredded, with the shreds scanned and reconstructed as digital versions.

It is related to scalable mode training, however. Chopping the spine off books and putting the pages in an automated scanner is not scalable. And don't forget about the cost of 1) finding 2) purchasing 3) processing and 4) recycling that volume of books.

> Chopping the spine off books and putting the pages in an automated scanner is not scalable.

That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.

We hem and haw about metaphorical "book burning" so much we forget that books themselves are not actually precious.

The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.

>we forget that books themselves are not actually precious.

Book burnings are symbolic (Unless you're in the world of Fareinheit 451). The real power comes from the political threat, not the fact that paper with words on them is now unreadable.

The real power comes from the purging of knowledge from institutions that can keep that knowledge alive. Facts, ideas and histories can all be incinerated.

Well, the famous 1933-05-10 book burning did destroy the only copies of a lot of LGBT medical research, and destroying the last copy of various works was a stated intent of Nazi book burnings.

I remember them having a 3D page unwarping tech they built as well so they could photograph rare and antique books without hacking them apart.

No, that's not how Google Books did it. https://en.wikipedia.org/wiki/Google_Books#Scanning_of_books

I don't think Google Books scanner chopped off the spine. https://linearbookscanner.org/ is the open design they released.

Oh I didn't know that. That's wild

I guess companies will pay for the cheapest copies for liability and then use the pirated dumps. Or just pretend that someone lent the books to them.

Its not settled whether AI training is fair use.

Okay, so the blame for the offense was laundered..

Paying $3,000 for pirating a ~$30 book seems disproportionate.

I feel like proportionality is related also to the scale. If a student pirates a textbook, I’d agree that 100x is excessive, but this is a corporation handsomely profiting off of mass piracy.

It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.

Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.

> handsomely profiting

Well actively generating revenue at least.

Profits are still hard to come by.

Operating profits certainly but if you include investments the big players are raking it in aren't they?

Investment is debt lol. Maybe you can make the argument that you're increasing the equity value but you do have to eventually prove you're able to make money right? Maybe you don't, this system is pretty messed up after all.

As long as you have more money coming in than your costs, then it's technically a profit even if that money comes from investments.

It's not the same as debt from a loan, because people are buying a percentage stake in the company. If the value of the company happens to go to zero there's nothing left to pay.

But yeah, the amount of investment a company attracts should have something to do with the perception that it'll operate at a profit at some point

what a fascinating software project someone had the oppertunity to work on.

Not if 100 companies did it and they all got away.

This is to teach a lesson because you cannot prosecute all thieves.

Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.

If in most cases damages cannot be recovered or the criminal will never be caught in the first place, then what is the lesson being taught? Doesn't that just create a moral hazard where you "randomly" choose who to penalize?

It's about sending a message.

The message being you’ll likely get away with it?

They're setting up a pretty simple EV calc:

(Probability of not getting away with it) 0.01 * (Cost if caught) 1000 = 10x (Expected Cost) = not worth it

The EV calculation completely goes away if you add a layer of limited liability corporation.

Even if the goal is to deter crime, we still have a principle of proportionate punishment. We don't cut people's hands of for petty theft, and we don't execute people for exceeding the speed limit even though both should be pretty effective deterrents.

As long as they haven't been bullied into the corporate equivalent of suicide by the "justice" system it's not disproportionate considering what happened to Aaron Schwartz.

If anything it's too little based on precedent.

With the per-item limit for "willful infringement" being $150,000, it's a bargain.

And a low end of $750/item.

Were you not around when people were getting sued for running Napster?

Fines should be disproportionate at this scale. So it discourages other businesses from doing the same thing.

So they’re creating monopolies? The existing players were allowed to do it, but anyone that tries to do it now will be hit with a 1.5B fine?

Realistically it will be $30 per book and $2,970 for the lawyers

That's not how class actions work. Ever.

In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.

25% of 1.5B?

Well it's willful infringement so a court would be entitled to add a punitive multiplier anyway. But this is something Anthropic agreed to, if that wasn't clear.

[deleted]

Thanks for the reminder that what the Internet Archive did in its case would have been legal if it was in service of an LLM.

Many things become legal when the perpetrator has money.

The golden rule:

He who has the gold makes the rules

I like the IA as much as anyone else, but surely there's a significant difference between distributing literal word for word exact copies of copyrighted material and distributing statistical indexes about copyrighted material right?

LLM’s are turning out to be a real get-out-of-legal-responsibilities card, hey?

This is a good soundbite but doesn't make sense. The Internet Archive had to pay for redistributing copyrighted materials. Anthropic just paid too. (Note: redistributing != training)

[deleted]

[dead]

I guess they must delete all models since they acquired the source illegally and benefitted from it, right? Otherwise it just encourages others to keep going and pay the fines later.

In a prior ruling, the court stated that Anthropic didn't train on the books subject to this settlement. The record is that Anthropic scanned physical books and used those for training. The pirated books were being held in a general purpose library and were not, according to the record, used in training.

So how did they profit off the pirated books?

According to the judge, they didn't. The judge said they stored those books in a general purpose library for future use just in case they decided to use them later. It appears the judge took much issue with the downloading of "pirated content." And Anthropic decided to settle rather than let it all play out more.

But how the settlement cost was then defined if nobody read those books and there was no financial lost...

That is something which is extremely difficult to prove from either side.

It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?

The 500,000 number is the number of books that are part of the settlement. If they downloaded all of Libgen and the other sources it was more like >7Million. But it is a lot of work to determine which books can legitimately be part of the lawsuit. For example, if any of the books in the download weren't copyright (think self published) or not protected under US copyright law (maybe a book only published in Venezula) or it isn't clear who own the copyright then that copyright owner cannot be part of the class. So it seems like the 500,000 number is basically the smaller number of books for which the lawyers for the plaintiff felt they could most easily prove standing.

[deleted]

> Buying used copies of books, scanning them, and training on it is fine.

Buying used copies of books, scanning them, and printing them and selling them: not fair use

Buying used copies of books, scanning them, and making merchandise and selling it: not fair use

The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research. Training AI models for purposes other than purely academic fits into none of these.

Buying used copies of books, scanning them, training an employee with the scans: fair use.

Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.

> Buying used copies of books, scanning them, training an employee with the scans: fair use.

Does this still hold true if multiple employees are "trained" from scanned copies at the same time?

Simultaneously I guess that would violate copyright, which is an interesting point. Maybe there's a case to be made there with model training.

Regardless, the issue could be resolved by buying as many copies as you have concurrent model training instances. It isn't really an issue with training on copyrighted work, just a matter of how you do so.

Computers aren't people. And analogies aren't laws.

Yes, but the law doesn’t exist, so until it catches up, analogies are all the legal system has to work with.

The purpose and character of AI models is transformative, and the effect of the model on the copyrighted works used in the model is largely negligible. That's what makes the use of copyrighted works in creating them fair use.

Are "fantasy name generators" of the sort you find all over the place online fair use if the weighting of their generators is based on statistical information about names in fantasy novels? I would think most people would agree they're fair use, or if not in so many words, I think those people would find it pretty unfair for WotC to go around suing sites for running D&D character name generators.

Or let's talk about another form of buying copyrighted / protected content and selling the results of transforming it: emulators. The Connectix Virtual Game Station was the impetus for one of the most important lawsuits about emulation, and the ruling held that even though writing an emulator inherently involves copying copyrighted code, the result is sufficiently transformative and falls under fair use.

It fits the basicmost fair use: reading them. Current "training" can be considered as a gross form of reading.