Hacker News

https://www.reddit.com/r/MachineLearning/comments/1jsft3c/r_...

I'm still not quite sure how to think of this. Maybe as being like unrolling a diffusion model, the equivalent of BPTT for RNNs?

ActorNightly 10 days ago [ - ]

I think we need to start thinking about one shot training. I.e instead of context into LLM, you should be able to tell it a fact, and it will encode that fact into the updated weights.

cttet 10 days ago [ - ]

In all their experiments, backprop is used for most of their parameter though...

hansvm 10 days ago [ - ]

There is a meaningful distinction. They only use backprop one layer at a time, requiring additional space proportional to that layer. Full backprop requires additional space proportional to the whole network.

It's also a bit interesting as an experimental result, since the core idea didn't require backprop. Being an implementation detail, you could theoretically swap in other layer types or solvers.