Excellent write up. I really appreciate the clear and detailed explanation of the algorithm.
The nice thing about Zig’s SIMD operation is the register size support is transparent. You can just declare a 64-byte vector as the Block, and Zig would use an AVX256 or two AVX2 (32-byte) registers behind the scene. All other SIMD operations on the type are transparently done with regard to the registers when compiled to the targeted platform.
Even using two AVX2 registers for 64 bytes of data is a win due to instruction pipelining. Most CPU have multiple SIMD registers and the two 32-byte data chunks are independent. CPU can run them in parallel.
The next optimization is to line the data up at 64 byte alignment, to match the L1/L2 cache line size. Memory access is slow in general.
Another challenge may be as a result of using a portable SIMD API instead of specific ISA instructions. I'm specifically thinking about computing the mask, which on x86-64 is I assume implemented via movemask. But aarch64 lacks such an instruction, so you need to do other shenanigans for the best codegen: https://github.com/BurntSushi/memchr/blob/ceef3c921b5685847e...
A language level or std library level poly-fill for the missing SIMD operations for the platforms would be great.