Yeah I’ve often wondered why folks aren’t training two tier MoEs for VRAM + RAM. We already have designs for shared experts so it cannot be hard to implement a router that allocated 10x or 100x as often to “core” experts vs the “nice to have” experts. I suppose balancing during training is tricky but some sort of custom loss on the router layers should work.

I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.

I think part of the issue is that in production deployments, you're batching high enough that you'll be paging in those long tail experts constantly.

Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.

It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.

Ideally, you should rearrange batches so that inference steps that rely on the same experts get batched together, then inferences that would "hold up" a batch simply wait for that one "long tail" expert to be loaded, whereupon they can progress. This might require checkpointing partial inference steps more often, but that ought to be doable.

I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.

But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.

I think your analysis is right this would make sense mostly for the 30B-3A style models that are mostly for edge / hobbyist use, where context length is precious so nobody is batching.

Given that experts live per layer I dont think it makes sense to have orbital mechanics experts but … I have wondered about swapping out the bottom 10% of layers per topic given that that is likely where the highest order concepts live. I’ve always wondered why people bother with LORA on all layers given that the early layers are more likely to be topic agnostic and focused on more basic pattern assembly (see the recent papers on how LLMs count on a manifold)

Maybe I am misunderstanding something but:

1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.

2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).

It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.

I don't have links handy but there is active research in this area.