The dimensions should actually be closer to 12000 * (no of tokens*no of layers / x)
(where x is a number dependent on architectural features like MLHA, QGA...)
There is this thing called KV cache which holds an enormous latent state.
The dimensions should actually be closer to 12000 * (no of tokens*no of layers / x)
(where x is a number dependent on architectural features like MLHA, QGA...)
There is this thing called KV cache which holds an enormous latent state.