Really cool paper, lots of examples of what worked, lots of interesting ideas. Some things I got from a first read-through:
- sample selection while training - while removing 0/8 and 8/8 problems was done before, I think it's interesting that they're doing it during training as well (as the model learns to solve some problems, they shift from x/8 closer to 8/8, and in this paper they remove them dynamically). Cool idea.
- increasing temp after an "entropy decrease" in the model - As the model "learns" new patterns, the entropy of answers decreases (based on ngrams) so they dynamically increase temperature to encourage discovery of more diverse answers.
- rope gives you free gains.
- each model is different and what works at one scale doesn't necessarily work at other scales - I think this was "known", but cool to see it applies to RL as well.