Hacker News

The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.