MoE is from Google (Noam Shazeer)
MTP is from Meta
Another DeepSeek advance that the west are copying is DeepSeek Sparse Attention (DSA)
MoE is from Google (Noam Shazeer)
MTP is from Meta
Another DeepSeek advance that the west are copying is DeepSeek Sparse Attention (DSA)
Mixture-of-Expert (MoE) was introduced in the 1990s [1, 2], see also [3, 4]. The idea was that MoE scales up model capacity and only introduces small computation overhead. MoEs did not become viable for high-performance applications until sparse routing was integrated with modern deep networks, made possible by large-scale distributed computation. The breakthrough came with the development of sparsely gated networks [5], which showed that it is possible to maintain model accuracy while activating only a small fraction of a large parameter network during both training and inference.
[1] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts. (1991)
[2] M. I. Jordan, R. A. Jacobs, Hierarchical mixtures of experts and the EM algorithm. (1993)
[3] L. Xu, M. Jordan, G. E. Hinton, An alternative model for mixtures of experts. (1994)
[4] S. Waterhouse, D. MacKay, A. Robinson, Bayesian methods for mixtures of experts. (1995)
[5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean, Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. (2017)
Yes - I meant as applied to LLMs/Transformers.