Hacker News

That's an obvious exaggeration. The competition is using smaller weights already, some of which are floating point and some of which aren't.

And they use full size floats for training.

That means their paper is actually worse than SOTA, which is concerned with training in fp4 natively without full precision [0] for QAT.

[0] "full precision" in ML usually means 16 bit floats like bfloat16

I wouldn't say "worse". It's focusing on inference cost and leaving training at a default for now.