for (int i = 0; i < 1000; i++) {
small_numbers[smlen] = numbers[i];
smlen += (numbers[i] < 500);
}
is much faster than the conventional version with a conditional branch: for (int i = 0; i < 1000; i++) {
if (numbers[i] < 500) {
small_numbers[smlen] = numbers[i];
smlen += 1;
}
}
Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?
Here is another perspective:
- the first one (branchless) use the condition to SAVE the correct value (< 500): it temporarily writes any current value to the same index i, always overwriting the previous value, effectively saving it (by moving forward to i+1) only when the value is right (small number). Downside of this simple function: the last value may be bigger than 500
- the second one use the condition to ADD the value, when it is 100% sure it is a correct small number
They don't. After running, for the values in small_numbers from 0 to smlen-1 they are equivalent.
But if the last value of numbers[] is not smaller than 500, small_numbers[smlen] will contain that value for the first version whereas the second version does not write to small_numbers[smlen].
> "these two loops compute the same value"
At what sequence point? The branchless version writes to small_numbers[smlen], for any given value of smlen, potentially more than once; so there are observable points of time during the loop where the behavior is different. But after the loop, both contain the final write to small_numbers[i] for all 0 <= I < smlen; and the transient writes both don't change observed external behavior, and are apparently cheaper than fewer but conditional writes.
I think the small_numbers array would differ after the end of the loop if, for instance, numbers contained only numbers >= 500. Am I wrong?
smlen would be 0 for both if there are no small numbers, so end result of both is an empty array.
For the first version small_numbers[0] will contain an arbitrary value at the end, and for the second version it happens to contain the last number read, but that address is outside of the 0-length array being returned.
The point is that you should look only at the first smlen entries, which would be 0 for this case.
Writing to array[n] and not incrementing n means that the value just written is outside the "useful" range (from 0 to n-1) and will not be considered (it will be overwritten the next iteration).
I am rather thinking, if one is so much faster, and they are truly equal, why is the compiler too stupid to convert one into the other?
It doesn't convert bogosort into heapsort either, despite the second being much faster than the first. I'm guessing that it's not that easy going from one to the other because the only thing they have in common is the output (and only after you have checked the last value), so if the transformation is not hard-coded into the compiler, the odds of it randomly discovering the optimization is close to zero
Yeah, I would expect such transformations to be implemented as optimizations. Just like maybe (the admitedly simpler):
A syntactical transformation, where it is possible as an equivalent transformation.I may be overlooking special cases, but I thought the compiler is smart enough to infer that the array elements are integers and that `<` will result in a boolean, which is just `0` and `1` and will understand that having only the `if` without `else` branch is equivalent in this case. Guess I was wrong and the compiler is not sophisticated in this specific way.
The two code snippets do different things, apples and oranges... e.g. the array modification in the second example needs to move in front of the if for the two snippets to behave identically. I bet then the compiler output is the same with -O1 or higher.
PS: e.g. note how bla() (first code snippet) and blob() (fixed second code snippet) have identical output (both are turned into the same 'branchless' code via a conditional 'setl' instruction), but the blub() function (original second code snippet) differs because that function has different behaviour:
https://www.godbolt.org/z/h9Kfbn5bc
TL;DR: most 'branchless advice' that only tinkers with language features (like "x = a ? b : c" instead of an if) is useless because to the optimizer passes both are the same thing (a condition).
When there's a difference in the generated code then it's usually a bug and the before-after code are not actually equivalent (like in the code examples above).
It only increments if the number was less than 500, effectively just saving the ones less than 500.
First version has a side effect of writing to small_numbers[0] always.
The compiler probability can't optimize that in the second version.
If it wrote unconditionally and incremented only in the if then I'd guess they would compile to the same thing.
numbers[i] < 500
is a conditional (true or false) that evaluates to 1 or 0 (in C)
Therefore smlen has either a 0 or a 1 added to it's value .. equivilent to only adding 1 if True.