Hacker News

A full KV-cache is quite big compared to the weights of the model (depending on the context size), that should be a factor too (and basically you need to maintain a separate KV cache for each request, I think...). Also the the token/s is not uniform across the request and it's getting slower with each subsequent generated token.

On the other side, there's an insane booster of speculative decoding, that would give a semi-prefill rate for decoding, but the memory pressure is still a factor.

I would be happy to be corrected regarding both factors.