Hacker News

Marha01 4 hours ago [ - ]

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

jesse__ 3 hours ago [ - ]

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

FeepingCreature 2 hours ago [ - ]

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

jesse__ an hour ago [ - ]

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

kgeist 3 hours ago [ - ]

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

FeepingCreature 2 hours ago [ - ]

I would be extremely surprised if it was that small.

gmueckl 3 hours ago [ - ]

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.