Thanks for checking. It sounds like the C compiler isn't doing a great job here of 'seeing through' the logic gate operations and compiling them down to something closer to optimal machine code. Maybe this is an example of how C isn't necessarily great for numerical optimization, or the C compiler is just bailing out of analysis before it can fix it all up.
A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.
Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.
Something like this should be hitting the instruction level vectoriser, the basic block at a time one, nearly bang on. Its a lot of the same arithmetic op interleaved. It might be a good test case for llvm - I would have expected almost entirely vector instructions from this.
z3 has good python bindings, which I've messed around with before. My manual solution uses 42 gates, I would be interested to see how close to being optimal it is. I didn't ask the compiler to vectorize anything, doing that explicitly might yield a better speedup.
Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.