Hacker News

Not during prefill, i.e. the very first token generated in a new conversation. During this forward pass, all tokens in the context are all processed at the same time, and then attention's KV are cached, you still generate a single token, but you need to compute attention from all tokens to all tokens.

From that point on every subsequent tokens is processed sequentially in autoregressive way, but because we have the KV cache, this becomes O(N) (1 token query to all tokens) and not O(N^2)