for RL cost*
pretraining becomes more expensive actually as you make MoE models sparser (you need more tokens in the pretrain, and if you don't have that then you need to train for longer)
for RL cost*
pretraining becomes more expensive actually as you make MoE models sparser (you need more tokens in the pretrain, and if you don't have that then you need to train for longer)