Hacker News

> in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS

wild

Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.

"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)

gchamonlive 7 hours ago [ - ]

> Why are we comparing a programing language and a GPU.

You are taking the statement too literally and forgetting it's a figure of speech, specifically metonymy.

When the author says it's millions of flops faster in a gpu than in an interpreteted programming language, it's not comparing them directly, but algorithms that run in them, so the substitution is the algorithms for the tools used to implement/run them.

It makes sense if you say "running similar logic -- like multiplying vectors and matrices -- on the CPU is millions of flops slower then on the GPU". There is no category error there.

tosh 7 hours ago [ - ]

the sentence is ambiguous because "Python" can mean python + a certain library and even a different Python implementation

but I find it illuminating to compare what a certain hardware can do in principle (what is possible) vs what I can "reach" as programmer within a certain system/setup

in this case NVIDIA A100 vs "Python" that does not reach a A100 (without the help of CUDA and PyTorch)

another analogy:

I find it useful to be able to compare what the fastest known way is to move a container from A to B using a certain vehicle (e.g. truck) and how that compares to how fast a person that can not drive that truck can do it + variants of it (on foot, using a cargo bike, using a boat via waterway, …)

I'm also interested in how much energy is needed, how much the hw costs and so on

Often there are many ways to do things, comparing is a great starting point for learning more

tosh 7 hours ago [ - ]

related to the truck analogy: an advantage of the way slower Python approach is: it does not need a GPU

that said: Python can get to more FLOPs by changing the representation: https://docs.python.org/3/library/array.html

smasher164 6 hours ago [ - ]

> This is a category error.

Okay, but surely you know what they actually mean right, or are you being willfully obtuse? They are comparing CPython (the main python implementation)'s implementation that runs on the CPU with a kernel running on the GPU.

bee_rider 6 hours ago [ - ]

I’m not 100%, in context. Sorry for the big quote:

> Overhead is when your code is spending time doing anything that's not transferring tensors or computing things. For example, time spent in the Python interpreter? Overhead. Time spent in the PyTorch framework? Overhead. Time spent launching CUDA kernels (but not executing them)? Also... overhead.

> The primary reason overhead is such a pernicious problem is that modern GPUs are really fast. An A100 can perform 312 trillion floating point operations per second (312 TeraFLOPS). In comparison, Python is really slooooowwww. Benchmarking locally, Python can perform 32 million additions in one second.

> That means that in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS.

> Even worse, the Python interpreter isn't even the only source of overhead - frameworks like PyTorch also have many layers of dispatch before you get to your actual kernel. If you perform the same experiment with PyTorch, we can only get 280 thousand operations per second. Of course, tiny tensors aren't what PyTorch is built for, but... if you are using tiny tensors (such as in scientific computing), you might find PyTorch incredibly slow compared to C++.

Emphasis mine.

It’s all a bit jumbled up. I get that he was going for an informal tone and this isn’t exactly a benchmark. But I’m still not sure, based on the second emphasized part I think the “bad” measurements are coming from Python+PyTorch but with too-small workloads, and dispatching to CPU, maybe? But the first one looks like naive Python loops.

p1esk 9 hours ago [ - ]

This statement makes zero sense

tosh 8 hours ago [ - ]

re comments:

yes of course this is apples to oranges but that's kind of the point

it shows the vast span between specialized hardware throughput IFF you can use an A100 at its limit vs overhead of one of the most popular programming languages in use today that eventually does the "same thing" on a CPU

the interesting thing is why that is so

CPU vs GPU (latency vs throughput), boxing vs dense representation, interpreter overhead, scalar execution, layers upon layers, …

p1esk 7 hours ago [ - ]

A100 FP32 throughput “at its limit”: 19.5 TFLOP/s.

AMD EPYC 9965 FP32 throughput “at its limit”: 41.2 TFLOP/s (192 cores x 64 FP32 FLOP/cycle/core x 3.35GHz).

aesthesia 2 hours ago [ - ]

That's also a CPU that came out four years later than the A100. The contemporaneous B200 is not optimized for FP32 and does 74.45 TFLOP/s. For FP16 it's at ~2 PFLOP/s.

p1esk an hour ago [ - ]

The point is that modern CPUs are not as slow as most DL people think. Roughly 10x slower but with a lot more memory.

zzzoom 3 hours ago [ - ]

EPYC 9965: 614GBps of 12-channel DDR5-6400

A100: 1935GBps of HBM2e

Most of those FLOPS are constrained by memory bandwidth.

Const-me 13 minutes ago [ - ]

> Most of those FLOPS are constrained by memory bandwidth

I believe inference with large enough batch size is almost always compute bound, simply due to algorithmic complexity.

Each step of tiled matric multiplication with square tiles of size N^2 takes O(N^2) memory loads and O(N^3) compute operations. With N = 32 or 64, you will likely saturate compute even on iGPUs with DDR4 or DDR5 memory pretending to be VRAM.

4 hours ago [ - ]

[deleted]

tosh 7 hours ago [ - ]

A100: 312 TFLOP/s for FP16

but it is very impressive how far modern CPUs get as well (also in smart phones!)

p1esk 7 hours ago [ - ]

Intel Xeon 6980P: 128 cores x 1024 FP16 FLOP/cycle/core x 3.2 GHz: 419 TFLOP/s

tosh 6 hours ago [ - ]

I'm not saying "GPU more brrt than CPU"

I found the comparison interesting

on Intel Xeon 690P with 419 TFLOP/s it is still (maybe even more?) interesting to ask:

how much throughput can you reach with Python, Python with lib x, y, z, with C++ like this, with C++ like that etc etc and why?

no?

p1esk 5 hours ago [ - ]

No one in their right mind would use pure Python to do matrix multiplication. It’s like using a screwdriver to hammer nails into wood.

But this discussion is even more bizarre than comparing a screwdriver to a hammer, it’s like comparing a screwdriver to a nail.

itishappy 7 hours ago [ - ]

Which, lets be honest, is probably still being orchestrated by Python somewhere.

Python is 9.75 million times faster than Python.

giancarlostoro 7 hours ago [ - ]

I was researching if there was much benefit to using Rust or C++ over Python for AI, and turns out, the GPU doesn't care once the instructions are in because its an entirely different spec running on the GPU. The only thing you might save on is "startup" costs of getting your code into the GPU I guess? I assume that time cost is miniscule though, once its all in memory, nobody cares that you spent any time "booting it up" any more than how long Windows takes these days.

BillStrong 6 hours ago [ - ]

As long as you don't keep calling out to the CPU, that is.

Tool calling, searches, cache movement if used, and even debug steps all stall the GPU waiting for the CPU.

There was a test of turning one of the under 1B Qwen3+ models into a kernel that didn't stall by the CPU as one GPU pass that saw quite a bit f perf lift over vLLM, I believe, showing this is an issue still.

Its been a month, so I don't remember more details than this.

hashmap 4 hours ago [ - ]

you can port anything python is doing with a couple prompts into rust/c++, including parity validation. when the barrier to migrating is that thin, you are losing money and time even continuing to talk about it. python is miserably slow, so dont let it touch any part of your system. no snakes in the house.

jmalicki 5 hours ago [ - ]

Pytorch dataloaders are often horribly inefficient, a lot of stuff there can benefit from Rust/C++

xyzsparetimexyz 9 hours ago [ - ]

Single core vs multi core accounts for much of this

cdavid 9 hours ago [ - ]

Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.

The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.

See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.

p1esk 7 hours ago [ - ]

Theoretical FP32 performance of AMD EPYC 9965 is double that of A100: 41.2 TFLOP/s vs 19.5 TFLOP/s

9 hours ago [ - ]

[deleted]