Another challenge may be as a result of using a portable SIMD API instead of specific ISA instructions. I'm specifically thinking about computing the mask, which on x86-64 is I assume implemented via movemask. But aarch64 lacks such an instruction, so you need to do other shenanigans for the best codegen: https://github.com/BurntSushi/memchr/blob/ceef3c921b5685847e...

A language level or std library level poly-fill for the missing SIMD operations for the platforms would be great.