What 1T parameter base model have you seen from any of those labs?
its moe, each expert tower can be branched from some smaller model.
That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.
its moe, each expert tower can be branched from some smaller model.
That's not how MoE works, you need to train the FFN directly or else the FFN gate would have no clue how to activate the expert.