Hacker News

You can do that, but you're going to have rather low throughput unless you have lots of PCIe lanes to attach storage to. That's going to require either a HEDT or some kind of compute cluster.

Batching inferences doesn't necessarily help that much since as models get sparser the individual inferences are going to share fewer experts. It does always help wrt. shared routing layers, of course.