I think you are making my point. Having a little slower, but a lot more, memory on the card would speed this use-case up a lot and remove the need to go to system memory or make it available for very rarely used experts allowing for even larger MOE models running with good performance.

I think speeding up long context and opening up the use of models with larger shared layers is ultimately more relevant than hosting unused MoE layers. Of course you could do that as a last resort, i.e. when running with a smaller context that leaves some VRAM free to use.

Long context will be solved and capped and turned into a theta 1 operation or, at worst, theta log(n). People don't have infinite perfect recall so agents don't need it. Also, there are really good solutions to it that just aren't explored enough right now since transformer architectures are where everyone is dumping money and time. I suspect very soon somone will have a much better system that just takes over and then the idea of context limits will be a thing of the past. I've actually built something myself that allows infinite context/perfect recall in theta 1 (minor asterisk here as there has to be but meh). I know others have solutions too.