Yes and no. I wasn't expecting to be able to reproduce the work, so I'm just content that it works. I was very surprised by how much hyperparameter finagling I had to do to get the DLGN converging; the tiny relu network I trained at the beginning, in comparison, converged with dead-simple SGD in a third of the epochs.

The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.

But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.

what percentage of ops were passthru ?

ps. superb writeup and project

Thank you! Good question, Here are the NN stats, before lowering to C:

    total gates        | 2303 | 100.0%
    -------------------+------+-------
    passthrough        | 2134 |  92.7%
    gates w/ no effect | 1476 |  64.1%
Note the rows aren't mutually exclusive.