I’m not 100%, in context. Sorry for the big quote:
> Overhead is when your code is spending time doing anything that's not transferring tensors or computing things. For example, time spent in the Python interpreter? Overhead. Time spent in the PyTorch framework? Overhead. Time spent launching CUDA kernels (but not executing them)? Also... overhead.
> The primary reason overhead is such a pernicious problem is that modern GPUs are really fast. An A100 can perform 312 trillion floating point operations per second (312 TeraFLOPS). In comparison, Python is really slooooowwww. Benchmarking locally, Python can perform 32 million additions in one second.
> That means that in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS.
> Even worse, the Python interpreter isn't even the only source of overhead - frameworks like PyTorch also have many layers of dispatch before you get to your actual kernel. If you perform the same experiment with PyTorch, we can only get 280 thousand operations per second. Of course, tiny tensors aren't what PyTorch is built for, but... if you are using tiny tensors (such as in scientific computing), you might find PyTorch incredibly slow compared to C++.
Emphasis mine.
It’s all a bit jumbled up. I get that he was going for an informal tone and this isn’t exactly a benchmark. But I’m still not sure, based on the second emphasized part I think the “bad” measurements are coming from Python+PyTorch but with too-small workloads, and dispatching to CPU, maybe? But the first one looks like naive Python loops.