https://www.reddit.com/r/MachineLearning/comments/1jsft3c/r_...
I'm still not quite sure how to think of this. Maybe as being like unrolling a diffusion model, the equivalent of BPTT for RNNs?
https://www.reddit.com/r/MachineLearning/comments/1jsft3c/r_...
I'm still not quite sure how to think of this. Maybe as being like unrolling a diffusion model, the equivalent of BPTT for RNNs?
I think we need to start thinking about one shot training. I.e instead of context into LLM, you should be able to tell it a fact, and it will encode that fact into the updated weights.
In all their experiments, backprop is used for most of their parameter though...
There is a meaningful distinction. They only use backprop one layer at a time, requiring additional space proportional to that layer. Full backprop requires additional space proportional to the whole network.
It's also a bit interesting as an experimental result, since the core idea didn't require backprop. Being an implementation detail, you could theoretically swap in other layer types or solvers.