Hacker News

mixermachine 13 hours ago [ - ]

With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running. Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding. For pure chat applications this should be quite fine.

zozbot234 12 hours ago [ - ]

The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.