> You have to be careful about how you do it because those runtime checks can easily swamp the performance gains you get from SIMD.

That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size.

Why would checking for supported SIMD instructions cause that much additional work?

Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check.

> Why would checking for supported SIMD instructions cause that much additional work?

Because CPUID checks on x86 are expensive for whatever reason.

> That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size.

Compilers can't elide those checks unless they are given specific flags that tell them the target CPU supports that specific instruction set OR they always just choose to target the minimum supported SIMD instruction set on the target CPU. They often emit suboptimal code for all sorts of reasons, this being one of them.

> Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check.

Yes, but like I said, you have to do it very carefully to make sure you're calling CPUID once outside of a hot loop to initialize your decision making and then relying on the CPU's predictor to elide the cost of a boolean / switch statement in your code doing the actual dispatch.