Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.

"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)

> Why are we comparing a programing language and a GPU.

You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.

When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.

It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.

the sentence is ambiguous because "Python" can mean python + a certain library and even a different Python implementation

but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup

in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)

another analogy:

I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)

I'm also interested in how much energy is needed, how much the hw costs and so on

Often there are many ways to do things, comparing is a great starting point for learning more

related to the truck analogy: an advantage of the way slower Python approach is: it does not need a GPU

that said: Python can get to more FLOPs by changing the representation: https://docs.python.org/3/library/array.html

> This is a category error.

Okay, but surely you know what they actually mean right, or are you being willfully obtuse? They are comparing CPython (the main python implementation)'s implementation that runs on the CPU with a kernel running on the GPU.

I’m not 100%, in context. Sorry for the big quote:

> Overhead is when your code is spending time doing anything that's not transferring tensors or computing things. For example, time spent in the Python interpreter? Overhead. Time spent in the PyTorch framework? Overhead. Time spent launching CUDA kernels (but not executing them)? Also... overhead.

> The primary reason overhead is such a pernicious problem is that modern GPUs are really fast. An A100 can perform 312 trillion floating point operations per second (312 TeraFLOPS). In comparison, Python is really slooooowwww. Benchmarking locally, Python can perform 32 million additions in one second.

> That means that in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS.

> Even worse, the Python interpreter isn't even the only source of overhead - frameworks like PyTorch also have many layers of dispatch before you get to your actual kernel. If you perform the same experiment with PyTorch, we can only get 280 thousand operations per second. Of course, tiny tensors aren't what PyTorch is built for, but... if you are using tiny tensors (such as in scientific computing), you might find PyTorch incredibly slow compared to C++.

Emphasis mine.

It’s all a bit jumbled up. I get that he was going for an informal tone and this isn’t exactly a benchmark. But I’m still not sure, based on the second emphasized part I think the “bad” measurements are coming from Python+PyTorch but with too-small workloads, and dispatching to CPU, maybe? But the first one looks like naive Python loops.