Your calculations make no sense. Why are you loading the model for each token independently? You can process all the input tokens at the same time as long as they can fit in memory.

You are doing the calculation as they were output tokens on a single batch, it would not make sense even in the decode phase.

This. ChatGPT also agrees with you: "74 GB weight read is per pass, not per token." I was checking the math in this blog post with GPT to understand it better and it seems legit for the given assumptions.

Then the right calculation is to use FLOPs not bandwidth like they did.