Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.
Why are GPUs necessarily higher latency than TPUs? Both require roughly the same arithmetic intensity and use the same memory technology at roughly the same bandwidth.
And our LLMs still have latencies well into the human perceptible range. If there's any necessary, architectural difference in latency between TPU and GPU, I'm fairly sure it would be far below that.
My understanding is that TPUs do not use memory in the same way. GPUs need to do significantly more store/fetch operations from HBM, where TPUs pipeline data through systolic arrays far more. From what I've heard this generally improves latency and also reduces the overhead of supporting large context windows.