My gripe with an approach like this is the lack of any grounding to these generated topics. Hallucination accumulates like error in this case so every generation that is conditioned by a previous one (the recursive "hierarchical topic exploration" in TFA).
I suspect most of the "leafs" are unusable.
The question is: Is it like jpeg compression where the errors do not accumulate but the image comverges to a self inverse compressed image or does the data set converge to a single point which is meaningless?
The transformation function in jpeg (DCT) is generally well defined math. While lossy, most of the information is reprocudable.
An LLM is layers and layers of non-linear transformations. It's hard to say exactly how information is accumulated. You can inspect activations from tokens but it's really not clear how to define what the function is exactly doing. Therefore error is poorly understood.
JPEG is similar actually. The DCT is invertible, but the result of the DCT is quantized, which is where some of the compression happens (DCT -> quantization -> IDCT), so the end to end process is not truly invertible. Maybe an analogy to the non-linearities in between the linear steps in deep learning
I think it would be interesting to deflate out to a huge dataset and see where this happens.
Certainly it will occur as the generated data exceeds the original, eg after 1-10T tokens.
I think you could also do this faster by moving down the tree in a depth first manner.
Typically I use this for knowledge transfer, style transfer, catastrophic forgetting mitigation, etc and so I don’t go very far. I usually manually review the data samples before using it.
Huh. I wonder what good output would look like at extremes. Hallucinations that just happen to be true or something more interesting?
Not different for inference... Just saying.