Acceleration by using the x86 AVX-512 extensions is especially compelling. Since ARM64 processors are becoming pervasive in server-side systems, is-there/will-there-be any optimization using the ARM64 NEON vector instructions in current or future Go versions? (The NEON instructions are 128-bit, instead of 512 bits in the AVX-512 set, but may still be useful.)