There are 163 lines of C. Of them, with -O3, 104 lines are present in the assembly output. So the C compiler is able to eliminate an additional ~36.2% of the instructions. It doesn't do anything fancy, like autovectorization.
I profiled just now:
| instrs (aarch64) | time 100k (s) | conway samples (%) |
| -O0 | 606 | 19.10s | 78.50% |
| -O3 | 135 | 3.45 | 90.52% |
The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like: 0000000100002d6c ldr x8, [sp, #0x4a0]
0000000100002d70 ldr x9, [sp, #0x488]
0000000100002d74 orn x8, x8, x9
0000000100002d78 str x8, [sp, #0x470]
Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization): 0000000100003744 orr x3, x3, x10
0000000100003748 orn x1, x1, x9
000000010000374c and x1, x3, x1
0000000100003750 orr x3, x8, x17
Looking more closely, there's actually surprisingly little register spilling.I think the real question you're asking is, as I wrote:
> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?
Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.
Hope that answers your questions. Love your writing Gwern.
Thanks for checking. It sounds like the C compiler isn't doing a great job here of 'seeing through' the logic gate operations and compiling them down to something closer to optimal machine code. Maybe this is an example of how C isn't necessarily great for numerical optimization, or the C compiler is just bailing out of analysis before it can fix it all up.
A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.
Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.
Something like this should be hitting the instruction level vectoriser, the basic block at a time one, nearly bang on. Its a lot of the same arithmetic op interleaved. It might be a good test case for llvm - I would have expected almost entirely vector instructions from this.
z3 has good python bindings, which I've messed around with before. My manual solution uses 42 gates, I would be interested to see how close to being optimal it is. I didn't ask the compiler to vectorize anything, doing that explicitly might yield a better speedup.
Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.