I like that more people are getting involved with SIMD, and there have been several posts lately on both memmem-like and memcpy-like operations implemented in SIMD in different programming languages.
In most cases, though, these still focus on AVX/NEON instructions from over 10 years ago, rather than newer and more powerful AVX-512 variations, SVE & SVE2, or RVV.
These newer ISAs can noticeably change how one would implement a state-of-the-art substring search or copy/move operation. In my projects, such as StringZilla, I often use mask K registers (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...) and an input-dependent mix of temporal and non-temporal loads and stores (https://github.com/ashvardanian/StringZilla/blob/2f4b1386ca2...).
In typical cases, the difference between the suggested SIMD kernels and the state-of-the-art can be as significant as 50% in throughput. As SIMD becomes more widespread, it would be beneficial to focus more on delivering software and bundling binaries, rather than just the kernels.
Sure, but I have to support a range of target CPUs in the consumer desktop market, and the older CPUs are the ones that need optimizations the most. That means NEON on ARM64 and AVX2 or SSE2-4 on x64. Time spent on higher vector instruction sets benefits a smaller fraction of the user base that already has better performance, and that's especially problematic if the algorithm has to be reworked to take best advantage of the higher extensions.
AVX-512 is also in bad shape market-wise, despite its amazing feature set and how long it's been since initial release. The Steam Hardware Survey, which skews toward the higher end of the market, only shows 18% of the user base having AVX-512 support. And even that is despite Intel's best efforts to reverse progress by shipping all new consumer CPUs with AVX-512 support disabled.
I’m not as familiar with the NEON side, but AVX512 support is pretty variable on new processors. Alder Lake omits it entirely. So we’re still in a world where AVX2 is the lowest common denominator for a system library that wants wide support.
Even that is too high of a requirement if your target user runs low end hardware. Most Intel chips launched between 2017 and 2021 under the Pentium Silver/Gold and Celeron brands lack AVX (the first one, let alone AVX2).
Not so much in AWS, though I’m unsure of other cloud providers. For desktop systems, sure.
How strange! I was about to add a comment that I would probably stick to SSE2 or something like that to be sure my code suits as large an audience as possible, including CPUs from more than 10 years ago, ARM, etc.
Case in point: I've been very disappointed lately when I wanted to try Ghostty on my laptop and the binary compiled for Debian failed to run due to an invalid instruction. I don't want to force the same experience to others.
This is sort of a category error. I don't know what ghostty is doing (or perhaps its distributor), but lots of widely used software (including my own ripgrep, but also even things like glibc) will query what the CPU supports at runtime and dispatch to the correct routine based on that. So things like, "I'm only going to stick to SSE2 for maximal compatibility" have a false assumption baked into them.
PS: Finding CPUs that support AVX-512 and SVE is relatively trivial - practically every cloud has them by now. It's harder to find Arm CPUs with wide physical registers, but that's another story.
But no one likes to develop on the cloud. The latency and storage synchronization can be very off putting.
Because it is very hard to find new hardware to test it, let alone expect your users to take advantage of it on their machines.
AVX512 is such a mess that Intel just removed it after a generation or two. And on ARM SVE side it is even worse. There is already SVE2, but good luck finding even a SVE enabled machine.
Apple does not support it on their Apple Silicon™ (only SME), Snapdragon does not support it even on their latest 8 Elite. 8 Elite Gen 2 is supposed to come with it.
Only Mediatek and Neoverse chips support them. So finding one machine to develop and test such code can be a little difficult.