For something this short that is pure math, why not just hand write asm for the most popular platforms? Prevents compiler from deoptimizing in the future.

Have a fallback with this algorithm for all other platforms.

This pretty much is assembly written as C++... there's not much the compiler can ruin.

Because that isn’t portable?