Could be that the CUDA backend has seen far more specialization optimizations whereas the seeingly fairly fresh HIP backend hasn't had as many developers looking at it, in the end a few more control instructions on the CPU side to go through the ZLUDA wrapper will be insignificant compared to all the time spent inside better optimized GPU kernels.