I think it would be interesting to deflate out to a huge dataset and see where this happens.

Certainly it will occur as the generated data exceeds the original, eg after 1-10T tokens.

I think you could also do this faster by moving down the tree in a depth first manner.

Typically I use this for knowledge transfer, style transfer, catastrophic forgetting mitigation, etc and so I don’t go very far. I usually manually review the data samples before using it.

Huh. I wonder what good output would look like at extremes. Hallucinations that just happen to be true or something more interesting?