GPUs are different than CPUs.
They’re way more efficient at matmuls, but start throwing branching logic at them and they slow down a lot.
Literally a percentage of their cores will noop while others are executing a branch, since all cores are lockstep.