[flagged]

The websites, music, movies, books, photos, art that they stole didn't appear out of thin air. The amount of time and effort people have collectively poured into creating these works throughout history far, far surpasses Anthropic's own effort of converting them into model weights.

The equivocation is crawling website <-> crawling LLM responses.

Both Anthropic and Alibaba are trying to build bleeding edge LLMs. That part is the same. The way they source their data is slightly different, but they would both argue it constitutes fair use under Copyright law.

"Your extremely efficient multi petabyte internet content suction machine is ripping off my extremely efficient multi petabyte internet content suction machine"

Sucking down petabytes of peoples' copyrighted content that they never granted a specific license to you to use seems to be an unavoidable and default part of the process of building any huge LLM.

So why was there crawling in 1998 but no LLMs?

Because the transformer, which all of these models are foundationally built off of and didn't invent themselves (bar google) wasn't invented? The amount of effort it took humanity to generate all the data that was required for the models to get to the point they're at now is absolutely not even comparable to how much effort it took to build the model code. Yeah, it's complicated, but if they didn't rip off all of humanities combined output it wouldn't even matter if the transformer got invented.

Google didn't really invent much, they just had access to an insane amount of data and compute to try to train a model with just the attention mechanism, but ripping out (most of) the rest, from an earlier paper on machine translation from some poor academics, and it turned out to work very well (though insanely training data and compute intensive).

I am unable to comprehend the state of mind that would lead one to ask this question.

We didn't have GPUs with hundreds of gigabytes of VRAM and tensor processing cores.

Or a feasible/economical way to attempt to store the sum total of human written output, multi-petabytes of data (outside of the resources of the NSA, maybe), when a server with 6 x 36GB 10K RPM SCSI HDD in RAID-5 was high end, and its network uplink would be at most two ports of 1 gigabit ethernet.

[deleted]

[flagged]

It's not really equivocation in this instance. This feels like a 'bad faith' comment. We can do better.

LLM's literally wouldn't work without the sum total of knowledge (in the forms of books and other copyrighted content) being used as 'training data' for these LLMs.

The 'bleeding edge' LLMs required many things, but: 1 Tech innovation ('attention') 2 Lots of compute 3 Data 4 Pre + post training

#4 doesn't happen without #3.

It's pretty obvious at this point that the major providers have stolen vast amounts of #3 - they have paid nearly 0 of the creators.

We can argue about the impact (I'd lean net good) vs. the cost. But arguing there isn't a cost is a bit silly.

All of this supports the fact that models arent essentially just web crawling

Sure, but alibaba is still building an LLM. The scraping of responses and the scraping of websites occupy the same location in the stack of each. It's very comparable.

The tech is Google's invention, popularized by OpenAI, so Anthropic should still stfu in that case.