Hacker News

dist-epoch 2 days ago [ - ]

This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.

> Asus made a crypto-mining motherboard that supports up to 20 GPUs

https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...

For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.

jsheard 2 days ago [ - ]

Those only gave each GPU a single PCIe lane though, since crypto mining barely needed to move any data around. If your application doesn't fit that mould then you'll need a much, much more expensive platform.

dist-epoch 2 days ago [ - ]

After you load the weights into the GPU and keep the KV cache there too, you don't need any other significant traffic.

numpad0 2 days ago [ - ]

Even in tensor parallel modes? I thought it could only work if you're fine stalling all but n GPU for n users at any given moments.

skhameneh 2 days ago [ - ]

In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.

Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.

zozbot234 2 days ago [ - ]

M.2 is mostly just a different form factor for PCIe anyway.