NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?