Hacker News

zozbot234 3 months ago [ - ]

There's plenty of scope for local AI models to become more efficient, too. MoE doesn't need too much RAM: only the parameters for experts that are active at any given time truly need to be in memory, the rest can be in read-only storage and be fetched on demand. If you're doing CPU inference this can even be managed automatically by mmap, whereas loading params into VRAM must currently be managed as part of running an inference step. (This is where GPU drivers/shader languages/programming models could also see some improvement, TBH)

dummydummy1234 3 months ago [ - ]

But aren't the experts chosen on a token by token basis, which means bandwidth limitations?

refulgentis 3 months ago [ - ]

Yes, with the direct conclusion from that being tl;dr in theory OPs explanation could mitigate RAM, in practice, it’s worse

(Source: I maintain an app integrated with llama.cpp, in practice no one likes 1 tkn/s generation times that you get from swapping, and honestly MoE makes RAM situation worse because in practice, model developers have servers and batch inference and multiple GPUs wired together. They are more than happy to increase the resting RAM budget and use even more parameters, limiting the active experts is about inference speed from that lens, not anything else)

imtringued 3 months ago [ - ]

MoE works exactly the opposite way you described. MoE means that each inference pass reads a subset of the parameters, which means that you can run a bigger model with the same memory bandwidth and achieve the same number of tokens per second. This means you're using more memory in the end.