37 billion bytes per token?
Edit: Oh assuming this is an estimate based on the model weights moving fromm HBM to SRAM, that's not how transformers are applied to input tokens. You only have to do move the weights for every token during generation, not during "prefill". (And actually during generation you can use speculative decoding to do better than this roofline anyways).
> (And actually during generation you can use speculative decoding to do better than this roofline anyways).
And more importantly batches, so taking the example from the blog post, it would be 32 tokens per each forward pass in the decoding phase.
There's also an estimation of how much a KV cache grows with each subsequent token. That would be roughly ~MBs/token. I think that would be the bottleneck