Also, performance is generally pretty low; I've been on projects where we rewrote OpenCV code into more-or-less obvious hand-rolled code and won 5x perf. The abstractions are generally a bit too thick and oriented around single pixels (which also makes the API a bit too verbose for my taste).
Machine vision has always been resource intensive... and if you are doing trained ML projects the hardware choices are actually very limited.
To enable Intel TBB, CUDA, and CPU specific compiler optimizations... one will almost certainly need to re-build the library, and customize your application build.
Some tasks degrade in performance on a GPU, and others are 740 times faster... ymmv. =3
It's not that you need to turn on some extra library backends and rebuild, it's that the abstractions themselves are fundamentally at odds with hitting peak performance on many things so you have to rewrite your code.
Individual image processing operations are often very low arithmetic intensity. If you don't combine them into much larger subroutines—which are necessarily less generic and orthogonal—you spend all your time waiting on memory between every little op.
> It's not that you need to turn on some extra library backends and rebuild
Our problem domains must obviously differ. Good luck =3