It took me a while to understand that internally it uses 128bit numbers, that `>> 64` in the pseudocode was super confusing until I saw the C++ code.

Neat code though!

Not really. It looks like that in the C code, but in the generated machine code it'll just be a single `MULH` instruction giving (only) the upper 64 bits of the result, no shift needed.