Hacker News

I think Cosmo's refutations were mostly not very useful and based on misunderstandings of what I was trying to say. This is fine and we discussed it prior to their article being published.

The point I was trying to make with "RL is only necessary once" is that you can embark on a single self-play loop getting better and better, and this will get you to something close to the frontier. Once you're at the frontier, the frontier doesn't move very much, so you have quite a while (decade?) where it's totally fine to distill from the RL games.

On correction histories -- imo I correctly described what they do. Cosmo was annoyed by the word "adapt" but what I described was the adaptation.

On SPSA -- you don't have a gradient! you don't do backprop! this is what i was trying to get at.