That's the thing with Python: A lot of things should be bound by all kinds of limitations, but are in practice often limited by the Python interpreter if not done carefully.
Fundamentally for example, if you're doing some operations on numpy arrays like: c = a + b * c, interpreted numpy will be slower than compiled numba or C++ just because an eager interpreter will never fuse those operations into an FMA.