Good write-up. This is a very popular approach to substring search! It is still worst case `O(m*n)` though. Do you have a fallback implementation like the `memchr` crate has to guarantee `O(m+n)`?

I'll also push back on some bits in the end:

    > But if it’s so much better, then why haven’t I made a pull request to
    > change std.mem.indexOf to use SIMD? Well, the reason is that
    >
    > std.mem.indexOf is generic over element size, and having a size
    > larger than u8 makes the algorithm much slower
    >
    > The algorithm used in stdmem.indexOf is cross-platform, while the
    > SIMD code wouldn’t be. (not all platforms have SIMD registers at all,
    > Arm has only 128-bit)
Does Zig not have a way to specialize this for sequences of unsigned 8-bit integers? If not, and you're thereforce force to used a more generic algorithm, that seems pretty unfortunate.

    > Substring searching is rarely the bottleneck in programs,
    > especially ones written in a fast language like Zig. That’s why
    > I don’t personally think it would be worth it to add it to the
    > standard library.
Oh I'm not sure I buy this at all! Substring search is a primitive operation and easily can be a bottleneck. There's a reason why widely used substring search implementations tend to be highly optimized.

I'm the author of the post, thanks for your feedback! I was inspired by your comment on HN a while back and started learning about this stuff, reading the source code of `memchr` was especially great.

You're totally right about the first part there was a serious consideration to add this to zig's standard library, there would definitely need to be a fallback to avoid the `O(m*n)` situation.

I'll admit that there are a lot of false assumptions at the end, you could totally specialize it for u8 and also get the block size according to CPU features at compile time with `std.simd.suggestVectorSize()`

Or at runtime, if you'd like. You can create a generic binary that runs faster on supported platforms.

> Or at runtime, if you'd like

You have to be careful about how you do it because those runtime checks can easily swamp the performance gains you get from SIMD.

> also get the block size according to CPU features at compile time with `std.simd.suggestVectorSize()`

You have to be careful with this since std.simd.suggestVectorSize is going to return values for the minimum SIMD version you're targeting I believe which can be suboptimal for portable binaries.

You probably want a mix where you carefully compute the vector size for the current platform globally once and have multiple compiled dispatch paths in your binary that you can pick based on that value & let the CPU prefetcher hide the cost of a check before each invocation.

> You have to be careful about how you do it because those runtime checks can easily swamp the performance gains you get from SIMD.

That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size.

Why would checking for supported SIMD instructions cause that much additional work?

Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check.

> Why would checking for supported SIMD instructions cause that much additional work?

Because CPUID checks on x86 are expensive for whatever reason.

> That seems surprising, particularly given that autovectorizing compilers tend to insert pretty extensive preambles that check for whether or not it's likely the vectorized one will have a speedup over the looping version (e.g., based on the number of iterations) and postambles that handle the cases where the number of loop iterations isn't cleanly divisible by the number of elements in the chosen vector size.

Compilers can't elide those checks unless they are given specific flags that tell them the target CPU supports that specific instruction set OR they always just choose to target the minimum supported SIMD instruction set on the target CPU. They often emit suboptimal code for all sorts of reasons, this being one of them.

> Also, even if this is the case, you can always check once and then replace the function body with the chosen one, eliding the check.

Yes, but like I said, you have to do it very carefully to make sure you're calling CPUID once outside of a hot loop to initialize your decision making and then relying on the CPU's predictor to elide the cost of a boolean / switch statement in your code doing the actual dispatch.

It is very easy to specialise a function in zig, you just put if(T == u8) or something like that inside the function and do w/e in there

Can the compiler detect that and use the proper code so no test is needed at runtime?

This is Zig so I guess the answer is "yeah, duh" but wanted to ask since it sounded like the solution is less, uh, "compiler-friendly" than I would expect.

Yes, and if you're paranoid you can write

  if (comptime T == u8) {
    // code
  }
to guarantee that if you're wrong about how the compiler behaves then you'll get a compiler error.

Zig has no type info at runtime so this is and always will be guaranteed to be a comptime check.

Hence the paranoia :-D

But I do think adding explicit `comptime` in places like this is reasonable for the sake of conveying intent to other programmers.