Hacker News

MoE works exactly the opposite way you described. MoE means that each inference pass reads a subset of the parameters, which means that you can run a bigger model with the same memory bandwidth and achieve the same number of tokens per second. This means you're using more memory in the end.