I think the reason this hasn't been done is you have no way to decide how many recursions are necessary at train time.
And if you pick a random number/try many different levels of recursion, you 'blur' the output. Ie. the output of a layer doesn't know if it should be outputting info important for the final result, or the output that is the best possible input to another round of recursion.
Yes, I think training this model would be hard. Perhaps something akin to how MoEs are trained where you impose some sort of loss distribution to encourage equitable routing, but for recursion.
Look at the human brain for useful analogies?
The default mode network does recursive/looping processing in the absence of external stimuli and world interaction. Multiple separate modules outside of the network are responsible for stopping and regulating this activity.
You could just learn the right estimated number of recursions, also passing 'backtracking'/'state' information at the next nested level. Kind of like how state space models encode extractible information via a basis function representation, you could encode extractible recursion state information into the embedding. See also transformers that can learn to recognize n-deep balanced parentheses (Dyck-n languages)
I have been thinking about this topic for some time. It might be done using the energy of the token. If it's still higher than an energy limit, then process it again, and increase the energy limit. The energy could be computed using log-sum-exp: https://openreview.net/pdf?id=Hkxzx0NtDB
This is actually how EfficientNet trains, using random truncation of the network during training. It does just fine... The game is that each layer needs to get as close as it can to good output, improving in the previous activation quality.