Just be careful not to blindly apply the same techniques to a mobile or desktop class CPU or above.

A lot of code can be pessimized by golfing instruction counts, hurting instruction-level parallelism and microcode optimizations by introducing false data dependencies.

Compilers outperform humans here almost all the time.

Compilers massively outperform humans if the human has to write the entire program in assembly. Even if a human could write a sizable program in assembly, it would be subpar compared to what a compiler would write. This is true.

However, that doesn't mean that looking at the generated asm / even writing some is useless! Just because you can't globally outperform the compiler, doesn't mean you can't do it locally! If you know where the bottleneck is, and make those few functions great, that's a force multiplier for you and your program.

It’s absolutely not useless, I do it often as a way to diagnose various kinds of problems. But it’s extremely rare that a handwritten version actually performs better.

yo, completely off topic, but do you work on a voxel game/engine?

yes and you already know me lol, we have been chatting on discord :P

> Compilers outperform humans here almost all the time.

I'm going to be annoying and nerd-snipe you here. It's, generally, really easy to beat the compiler.

https://scallywag.software/vim/blog/simd-perlin-noise-i

"A lot of code can be pessimized by golfing instruction counts"

Can you explain what this phrase means?

An old approach to micro-optimization is to look at the generated assembly, and trying to achieve the same thing with fewer instructions. However, modern CPUs are able to execute multiple instructions in parallel (out-of-order execution), and this mechanism relies on detecting data dependencies between instructions.

It means that the shorter sequence of instructions is not necessarily faster, and can in fact make the CPU stall unnecessarily.

The fastest sequence of instructions is the one that makes the best use of the CPU’s resources.

I’ve done this: I had a hot loop and I discovered that I could reduce instruction counts by adding a branch inside the loop. Definitely slower, which I expected, but it’s worth measuring.

It is not about outperforming the compiler - it’s about being comfortable with measuring where your clock cycles are spent, and for that you first need to be comfortable with clock cycle scale of timing. You’re not expected to rewrite the program in assembly. But you should have a general idea given an instruction what its execution entails, and where the data is actually coming from. A read from different busses means different timings.

Compilers make mistakes too and they can output very erroneous code. But that’s a different topic.

Excellent corrective summary.

"Compilers can do all these great transformations, but they can also be incredibly dumb"

-Mike Acton, CPPCON 2014