No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.
No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.