Only in micro-benchmarks.
For real usage, today's CPUs are limited by memory bandwidth.
What are you talking about in a hot loop in my software renderer this is like 10x faster
// color4_t result = { // .r = (src.r * src.a + dst.r * inv_alpha) * INV_255, // .g = (src.g * src.a + dst.g * inv_alpha) * INV_255, // .b = (src.b * src.a + dst.b * inv_alpha) * INV_255, // .a = src.a + (dst.a * inv_alpha) * INV_255 // }; // 1/256 but much faster color4_t result = { .r = (src.r * src.a + dst.r * inv_alpha) >> 8, .g = (src.g * src.a + dst.g * inv_alpha) >> 8, .b = (src.b * src.a + dst.b * inv_alpha) >> 8, .a = src.a + ((dst.a * inv_alpha) >> 8) };
If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.
Because you are working in the cache.
Also, you should use SIMD.
> Also, you should use SIMD. ironically no clang is better at auto vectorizing
[dead]
What are you talking about in a hot loop in my software renderer this is like 10x faster
If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.
Because you are working in the cache.
Also, you should use SIMD.
> Also, you should use SIMD. ironically no clang is better at auto vectorizing
[dead]