Hacker News

Given that 32 GB/s is significantly worse than CPU to RAM speeds these days, does the additional compute really make it any faster in practice? The KV cache is always on the GPU anyway unless you're doing something really weird, so it won't affect ingestion, and generation is typically bandwidth bound. With something like ×16 PCIe 6.0 it would actually make sense, but nothing less than that, or maybe for smaller dense models that are more compute bound with 8x PCIe 6.0 or 16x 5.0 but that's already below DDR5 speeds.

zozbot234 12 hours ago [ - ]

Additional compute is generally a win for prefill, while memory bandwidth is king for decode. KV cache however is the main blocker for long context, so it should be offloaded to system RAM and even to NVMe swap as context grows. Yes that's slow on an absolute basis but it's faster (and more power efficient, which makes everything else faster) than not having the cache at all, so it's still a huge win.

moffkalast 3 hours ago [ - ]

Well if you do that then you reverse the strengths of your system. It might be best to work with the context length you can offload, like a normal person.