> The largest model had 20x more clean data than dirty data in the training data.

Yeah, I think this is the main misinterpretation. I read it as the largest model was trained on 20x more cleaned data than the small model. I don't think the ratio of clean to dirty data was 20x. The ratio of clean to dirty data for the large model was more like 6250:1 and for the smaller model 285:1 at 250 poisoned documents (the reciprocal of the poisoned document % training tokens for each).