Hacker News

For LLM inference, I don't think the PCIe bandwidth matters much and a GPU could improve greatly the prompt processing speed.

zozbot234 4 days ago [ - ]

The Strix Halo iGPU is quite special, like the Apple iGPU it has such good memory bandwidth to system RAM that it manages to improve both prompt processing and token generation compared to pure CPU inference. You really can't say that about the average iGPU or low-end dGPU: usually their memory bandwidth is way too anemic, hence the CPU wins when it comes to emitting tokens.

ElectricalUnion 4 days ago [ - ]

Only if your entire model fits the GPU VRAM.

To me this reads like "if you can afford those 256GB VRAM GPUs, you don't need PCIe bandwidth!"

jychang 4 days ago [ - ]

No, that's not true. Prompt processing just needs attention tensors in VRAM, the MLP weights aren't needed for the heavy calculations that a GPU speeds up. (After attention, you only need to pass the activations from GPU to system RAM, which is about ~40KB so you're not very limited here).

That's pretty small.

Even Deepseek R1 0528 685b only has like ~16GB of attention weights. Kimi K2 with 1T parameters has 6168951472 attention params, which means ~12GB.

It's pretty easy to do prompt processing for massive models like Deepseek R1, Kimi K2, or Qwen 3 235b with only a single Nvidia 3090 gpu. Just do --n-cpu-moe 99 in llama.cpp or something similar.

tgma 4 days ago [ - ]

If you can't, your performance will likely be abysmal though, so there's almost no middle ground for the LLM workload.

jgalt212 4 days ago [ - ]

Yeah, I think so. Once the whole model is on the GPU (potentially slower start-up), there really isn't much traffic between the GPU and the motherboard. That's how I think about it. But mostly saying this as I'm interested in being corrected if I'm wrong.