MTP models share internal state with the main model, and also refer to parameters in the model.
They are more like a single model that has two separate attention head mechanisms.
MTP models share internal state with the main model, and also refer to parameters in the model.
They are more like a single model that has two separate attention head mechanisms.