In the paper summary they did not call it a bug explicitly, but they do say there are 32x improvements in using single bits instead.

That's an obvious exaggeration. The competition is using smaller weights already, some of which are floating point and some of which aren't.

And they use full size floats for training.

That means their paper is actually worse than SOTA, which is concerned with training in fp4 natively without full precision [0] for QAT.

[0] "full precision" in ML usually means 16 bit floats like bfloat16

I wouldn't say "worse". It's focusing on inference cost and leaving training at a default for now.

To memory, sure. At the cost of 32x slower speeds.