Hacker News

Absolutely, it's not going to work that well for MoE, though today most local models (except Qwen3-30B-A3B) are dense ones.

But even for MoE it will still work: sure the second parallel agent running is going to divide the token rate by almost two, but the reduction is exponentially decreasing and the 30th will almost be free. So if you have enough VRAM to run Qwen3-32B, you can run Qwen3-30B-3A at the same speed as the 32B version but you'll be running a hundred of instances.