Hacker News

IMHO, just for the sake of discussion, it does seem short of a bombshell. Perhaps only because I'm confused by the math and got some things wrong.

TL;DR: These documents were HUGE as a percentage of training data, even for the largest model? (192 MB / document). Dirty data was ~4% of the training data for even the largest model? And more than 100% of the training data for the smallest?

Via abstract: "on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data."

EDIT: Going through the paper more, p clear there's details that clarify. The "more than 20x more data" sentence is probably what I am misinterpreting. (ex. direct from the paper: "250 poison samples represent only 0.00016% of training tokens for the 13B model and 0.0035% for 600M")

Calculations:

- The largest model was trained on 260B tokens.

- 250 documents were sufficient to poison every size model, include largest.

- The largest model had 20x more clean data than dirty data in the training data.

- 20x + x = 260B tokens, where X = full size of dirty data, in tokens

- 21x = 260B tokens

- size of dirty data = 12B tokens

- size of dirty data = 250 documents

- tokens / document for dirty data = 48M tokens/dirty document

- token ~= 4 bytes

- dirty document = 192 MB?

azundo 4 days ago [ - ]

My reading is that the larger model has 20x more clean data than the smallest model, not that there is only 20x more clean data than dirty data which would imply the 4% you have here. I agree it could be worded more clearly.

Rudybega 4 days ago [ - ]

> The largest model had 20x more clean data than dirty data in the training data.

Yeah, I think this is the main misinterpretation. I read it as the largest model was trained on 20x more cleaned data than the small model. I don't think the ratio of clean to dirty data was 20x. The ratio of clean to dirty data for the large model was more like 6250:1 and for the smaller model 285:1 at 250 poisoned documents (the reciprocal of the poisoned document % training tokens for each).