> each parameter exists in two forms simultaneously during training: a full-precision 32-bit floating-point value (p) used for gradient updates, and its binarized counterpart (pb) used for forward computations

So this is only for inference. Also activations aren't quantized, I think?

Yes, that's been the downside of these forever.

If you use quantized differentiation you can get away with using integers for gradient updates. Explaining how takes a paper and in the end it doesn't even work very well.

At university, way back at the end of the last AI winter, I ended up using genetic algorithms to train the models. It was very interesting because weights were trained along with hyper parameters. It was no where near practical because gradient descent is so much better at getting real world results in reasonable time frames - surprisingly because it's more memory efficient.

You don't necessarily have to store the parameters in fp32 for gradient updates; I experimented with it and got it working (all parameter full fine-tuning) with parameters being as low as 3-bit (a little bit more than 3-bit, because the block-wise scales were higher precision), which is essentially as low as you can go before "normal" training starts breaking down.

> Also activations aren't quantized, I think?

The very last conclusion: "Future work will focus on the implementation of binary normalization layers using single-bit arrays operations, as well as on quantizing layer activations to 8 or 16-bit precision. These improvements are expected to further enhance the efficiency and performance of the binary neural network models."

Yeah, but it’s ’quantization aware’ during training too, which presumably is what allows the quantization at inference to work

I wonder if one could store only the binary representation at training and sample a floating point representation (both weights and gradient) during backprop.

Back propagation on random data that is then thrown away would be pretty useless.