Your understanding is correct. The key detail is that the author used an M1 Max and H100 for their testing.

M1 Max: FP16 hardware support, FP8 and Bfloat16 emulated in software (via dequantization)

H100: FP16 and FP8 hardware support

> which I ran both on a MacBook Pro M1 Max and a rented H100 SXM GPU