Your understanding is correct. The key detail is that the author used an M1 Max and H100 for their testing.
M1 Max: FP16 hardware support, FP8 and Bfloat16 emulated in software (via dequantization)
H100: FP16 and FP8 hardware support
> which I ran both on a MacBook Pro M1 Max and a rented H100 SXM GPU