It's not that you need to turn on some extra library backends and rebuild, it's that the abstractions themselves are fundamentally at odds with hitting peak performance on many things so you have to rewrite your code.

Individual image processing operations are often very low arithmetic intensity. If you don't combine them into much larger subroutines—which are necessarily less generic and orthogonal—you spend all your time waiting on memory between every little op.

> It's not that you need to turn on some extra library backends and rebuild

Our problem domains must obviously differ. Good luck =3