There are 163 lines of C. Of them, with -O3, 104 lines are present in the assembly output. So the C compiler is able to eliminate an additional ~36.2% of the instructions. It doesn't do anything fancy, like autovectorization.

I profiled just now:

          | instrs (aarch64) | time 100k (s) | conway samples (%) | 
    | -O0 |              606 |        19.10s |             78.50% |
    | -O3 |              135 |          3.45 |             90.52% | 
The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like:

    0000000100002d6c ldr x8, [sp, #0x4a0]
    0000000100002d70 ldr x9, [sp, #0x488]
    0000000100002d74 orn x8, x8, x9
    0000000100002d78 str x8, [sp, #0x470]
Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization):

    0000000100003744 orr x3, x3, x10
    0000000100003748 orn x1, x1, x9
    000000010000374c and x1, x3, x1
    0000000100003750 orr x3, x8, x17
Looking more closely, there's actually surprisingly little register spilling.

I think the real question you're asking is, as I wrote:

> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?

Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.

Hope that answers your questions. Love your writing Gwern.

Thanks for checking. It sounds like the C compiler isn't doing a great job here of 'seeing through' the logic gate operations and compiling them down to something closer to optimal machine code. Maybe this is an example of how C isn't necessarily great for numerical optimization, or the C compiler is just bailing out of analysis before it can fix it all up.

A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.

Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.

Something like this should be hitting the instruction level vectoriser, the basic block at a time one, nearly bang on. Its a lot of the same arithmetic op interleaved. It might be a good test case for llvm - I would have expected almost entirely vector instructions from this.

z3 has good python bindings, which I've messed around with before. My manual solution uses 42 gates, I would be interested to see how close to being optimal it is. I didn't ask the compiler to vectorize anything, doing that explicitly might yield a better speedup.

Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.