> interleaving the processing of 200ms worth of input and generation of 200ms worth of output.
How does this work? Don't LLMs/transformers need whole context to output next chunk of tokens?
> interleaving the processing of 200ms worth of input and generation of 200ms worth of output.
How does this work? Don't LLMs/transformers need whole context to output next chunk of tokens?