z3 has good python bindings, which I've messed around with before. My manual solution uses 42 gates, I would be interested to see how close to being optimal it is. I didn't ask the compiler to vectorize anything, doing that explicitly might yield a better speedup.
Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.