> There's at least a little flexibility with the graphics card if you move the board into a different case—there's a single PCIe x4 slot on the board that you could put an external GPU into, though many PCIe x16 graphics cards will be bandwidth starved.

https://arstechnica.com/gadgets/2025/08/review-framework-des...

There are no situations where this matters yet. You have to drop down to an 8x slot on PCIe 3.0 to even begin to see any meaningful impact on benchmarks (synthetic or otherwise)

For LLM inference, I don't think the PCIe bandwidth matters much and a GPU could improve greatly the prompt processing speed.

The Strix Halo iGPU is quite special, like the Apple iGPU it has such good memory bandwidth to system RAM that it manages to improve both prompt processing and token generation compared to pure CPU inference. You really can't say that about the average iGPU or low-end dGPU: usually their memory bandwidth is way too anemic, hence the CPU wins when it comes to emitting tokens.

Only if your entire model fits the GPU VRAM.

To me this reads like "if you can afford those 256GB VRAM GPUs, you don't need PCIe bandwidth!"

No, that's not true. Prompt processing just needs attention tensors in VRAM, the MLP weights aren't needed for the heavy calculations that a GPU speeds up. (After attention, you only need to pass the activations from GPU to system RAM, which is about ~40KB so you're not very limited here).

That's pretty small.

Even Deepseek R1 0528 685b only has like ~16GB of attention weights. Kimi K2 with 1T parameters has 6168951472 attention params, which means ~12GB.

It's pretty easy to do prompt processing for massive models like Deepseek R1, Kimi K2, or Qwen 3 235b with only a single Nvidia 3090 gpu. Just do --n-cpu-moe 99 in llama.cpp or something similar.

If you can't, your performance will likely be abysmal though, so there's almost no middle ground for the LLM workload.

Yeah, I think so. Once the whole model is on the GPU (potentially slower start-up), there really isn't much traffic between the GPU and the motherboard. That's how I think about it. But mostly saying this as I'm interested in being corrected if I'm wrong.

You can also use an adapter to repurpose an M.2 slot as PCIe x16, but the bandwidth is the same x4

That's just called a PCIe x4 [1]. Each PCIe lane is an independent channel. The wider slot will simply have disconnected pins. You can actually do this with regular motherboard PCIe x4 slots by cutting the plastic at the end of the slot so you can insert a wider card and most cards work just fine.

[1]: It sounds like a nitpick but a PCIe x16 with x4 effective bandwidth can exist and is a different thing: if the actual PCIe interface is x16, but there is an upstream bottleneck (e.g. aggregate bandwidth from chipset to CPU is not enough to handle all peripherals at once at full rate.)