GPUs are different than CPUs.

They’re way more efficient at matmuls, but start throwing branching logic at them and they slow down a lot.

Literally a percentage of their cores will noop while others are executing a branch, since all cores are lockstep.