It's not empirical fact, if you look at the code for AI frameworks you will see that this isn't true in practice when you go beyond a single isolated matrix multiplication

ok, have an amateur implement a hand-written Fourier transform in C, and have it beat numpy's implementation. There, now you have two examples, and there are loads more, like image and signal processing in general, handling data frame operations like grouping / joining things, big integer arithmetic / cryptography in general, just plain old sorting, etc.

Amateurs and even some folks who have worked with these things for a while won't beat off the shelf python calls to highly optimized and well constructed libraries.

I think we're talking past each other. I'm not suggesting that most users should be writing individual computational ops themselves in C. They'll certainly be unlikely to match the perf of expert written C that has had time invested in it. The point I'm trying to make is that when use a framework for modern AI you're not just calling an individual op or many individual ops in isolation. It matters how multiple dependent ops are sequenced along with other code, e.g. data loading code (that may be custom application specific, so not available in a library). My argument is that it may be easier to reach peak performance on your hardware if that framework code was all written in a lower level language.

Ah, i see - for production or most real-time systems you'd be right imo but the time taken to complete most tasks in a conventional ai development environment means the overhead from python moving from lib to lib becomes neigible, no?