Shouldn't the (MoE) mixture of experts approach allow one to conserve memory by working on specific problem type at a time?
> (MoE) divides an AI model into separate sub-networks (or "experts"), each specializing in a subset of the input data, to jointly perform a task.
Sort of, but the "experts" aren't easily divisible in a conceptually interpretable way so the naive understanding of MoE is misleading.
What you typically end up with in memory constrained environments is that the core shared layers are in fast memory (VRAM, ideally) and the rest are in slower memory (system RAM or even a fast SSD).
MoE models are typically very shallow-but-wide in comparison with the dense models, so they end up being faster than an equivalent dense model, because they're ultimately running through fewer layers each token.