Models have some limited means of refinement available to themselves already: augment a model with any form of external memory, and it can learn by writing to its memory and then reading relevant parts of that accumulated knowledge back in the future. Of course, this is a lot more rigid than what biological brains can do, but it isn’t nothing.

Does “distributional drift and mode collapse” still happen if the outputs are filtered with respect to some external ground truth - e.g. human preferences, or even (in certain restricted domains such as coding) automated evaluations?

I wasn’t talking about human reinforcement.

The discussion has been about CoT in LLMs, so I’ve been referring to the model in isolation from the start.

Here’s how I currently understand the structure of the thread (apologies if I’ve misread anything):

“Is CoT actually thinking?” (my earlier comment)

→ “Yes, it is thinking.”

  → “It might be thinking.”

   → “Under that analogy, self-training on its own CoT should work — but empirically it doesn’t.”

    → “Maybe it would work if you add external memory with human or automated filtering?”
Regarding external memory:

without an external supervisor, whatever gets written into that memory is still the model’s own self-generated output — which brings us back to the original problem.