>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:

>

>for (int i = 0; i < 1000; i++) {

> small_numbers[smlen] = numbers[i];

> smlen += (numbers[i] < 500);

>}

Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?

Nah. (numbers[i] < 500) is an expression which evaluates to true (1) or false (0). Evaluating this doesn't require a branch. There are instructions on modern CPUs to turn this expression into a number without a conditional jump. (cmp (compare), setle (set if comparison was less than or equal), then add).

> then wouldn't they also optimize the naive piece of code?

Great question. They do sometimes!

In general, the problem for compilers is that its not obvious which method would be better in some given piece of code. Most branches are highly predictable. Like, imagine a for loop which counts to 1000. At the end of the loop body, the code branches to see whether we should stay in the loop, or exit the loop. The first 999 times through the loop we keep going - so 99.9% of the time, the branch ends up taking the same path. Its very predictable! CPU designers optimise heavily for this, via branch prediction logic. Highly predictable branches run fast. (This is also why array bounds checking doesn't really hurt performance at all.)

But the branch predictor really struggles when the condition is unpredictable - ie, when a conditional branch is taken about 50% of the time. As is the case in a sorting algorithm.

The compiler has no idea whether any condition in your code is predictable or not. There are hints you can use, but it often defaults to just doing whatever you ask it to do.

Here's what the compiler actually does with the code you quoted. You can see the extra branch + jump for the second version of the code:

https://c.godbolt.org/z/zv7Tcd49f

I clicked around - for some reason even using __builtin_expect_with_probability, none of the compilers I tried will convert from one version of this code into the other.

If you hoist small_numbers[smlen] = numbers[i] out of the if statement then the compiler will produce nearly the same branchless assembly for both cases (the only difference being compare against 499 followed by setle vs. compare against 500 followed by setl).

Taking a second look I want to say you need to hoist the assignment for the loops to be semantically identical anyways.

[deleted]

At the bottom of the page there's a link, "When ‘if’ slows you down, avoid it" [1], that discusses these exact questions. It's basically what @josephg said, but it also shows the assembly language for each version.

[1] https://tiki.li/blog/branchless

There's no branch in that code either way. The comparison operator outputs a value (which is arithmetic, not a branch), and that value is added unconditionally.

Isn’t there an implicit check to exit the loop?

The check isn't important; what's important is being predictable so the CPU can guess which way the check will go. I don't know exactly how it works, but after the first couple of loops, the predictor will assume it's always going to end up in the loop and make that the fast path. It may guess wrong the first couple of loops, and the last check wrong, but the other 997 will be correct.

There is a static branch predictor that is used if there is no statistic on a branching instruction yet, and it's really simple: Jumps backward are assumed to be taken (they usually are from a loop), jumps forward are assumed to be not taken.

So the jump that forms the loop will be predicted correctly for all executions but the very last (when the loop ends).

That's very cute.

I wonder how much more complicated and effective statistical predictors are.

They get much more complicated, but their effectiveness tops out where certain branches just can’t be predicted in advance.

“that code” refers to the body of the loop.

Unless the loop is unrolled, yes, there is a branch to exit the loop. But then that doesn’t matter because the whole goal at the beginning was to avoid branch misprediction (which is not the same thing as avoiding branches entirely).