Hacker News

[flagged]

The websites, music, movies, books, photos, art that they stole didn't appear out of thin air. The amount of time and effort people have collectively poured into creating these works throughout history far, far surpasses Anthropic's own effort of converting them into model weights.

bloppe 19 hours ago [ - ]

The equivocation is crawling website <-> crawling LLM responses.

Both Anthropic and Alibaba are trying to build bleeding edge LLMs. That part is the same. The way they source their data is slightly different, but they would both argue it constitutes fair use under Copyright law.

walrus01 19 hours ago [ - ]

"Your extremely efficient multi petabyte internet content suction machine is ripping off my extremely efficient multi petabyte internet content suction machine"

Sucking down petabytes of peoples' copyrighted content that they never granted a specific license to you to use seems to be an unavoidable and default part of the process of building any huge LLM.

nonethewiser 19 hours ago [ - ]

So why was there crawling in 1998 but no LLMs?

hasteg 18 hours ago [ - ]

Because the transformer, which all of these models are foundationally built off of and didn't invent themselves (bar google) wasn't invented? The amount of effort it took humanity to generate all the data that was required for the models to get to the point they're at now is absolutely not even comparable to how much effort it took to build the model code. Yeah, it's complicated, but if they didn't rip off all of humanities combined output it wouldn't even matter if the transformer got invented.

Chu4eeno 17 hours ago [ - ]

Google didn't really invent much, they just had access to an insane amount of data and compute to try to train a model with just the attention mechanism, but ripping out (most of) the rest, from an earlier paper on machine translation from some poor academics, and it turned out to work very well (though insanely training data and compute intensive).

12_throw_away 14 hours ago [ - ]

I am unable to comprehend the state of mind that would lead one to ask this question.

vitally3643 17 hours ago [ - ]

We didn't have GPUs with hundreds of gigabytes of VRAM and tensor processing cores.

walrus01 17 hours ago [ - ]

Or a feasible/economical way to attempt to store the sum total of human written output, multi-petabytes of data (outside of the resources of the NSA, maybe), when a server with 6 x 36GB 10K RPM SCSI HDD in RAID-5 was high end, and its network uplink would be at most two ports of 1 gigabit ethernet.

18 hours ago [ - ]

[deleted]

jbxntuehineoh 17 hours ago [ - ]

[flagged]

epsteingpt 19 hours ago [ - ]

It's not really equivocation in this instance. This feels like a 'bad faith' comment. We can do better.

LLM's literally wouldn't work without the sum total of knowledge (in the forms of books and other copyrighted content) being used as 'training data' for these LLMs.

The 'bleeding edge' LLMs required many things, but: 1 Tech innovation ('attention') 2 Lots of compute 3 Data 4 Pre + post training

#4 doesn't happen without #3.

It's pretty obvious at this point that the major providers have stolen vast amounts of #3 - they have paid nearly 0 of the creators.

We can argue about the impact (I'd lean net good) vs. the cost. But arguing there isn't a cost is a bit silly.

nonethewiser 19 hours ago [ - ]

All of this supports the fact that models arent essentially just web crawling

margalabargala 19 hours ago [ - ]

Sure, but alibaba is still building an LLM. The scraping of responses and the scraping of websites occupy the same location in the stack of each. It's very comparable.

bel8 17 hours ago [ - ]

The tech is Google's invention, popularized by OpenAI, so Anthropic should still stfu in that case.