I think it would be interesting to deflate out to a huge dataset and see where this happens.
Certainly it will occur as the generated data exceeds the original, eg after 1-10T tokens.
I think you could also do this faster by moving down the tree in a depth first manner.
Typically I use this for knowledge transfer, style transfer, catastrophic forgetting mitigation, etc and so I don’t go very far. I usually manually review the data samples before using it.
Huh. I wonder what good output would look like at extremes. Hallucinations that just happen to be true or something more interesting?