A basic block simulator like llvm-mca is unlikely to give useful information here, as memory access is going to play a significant part in the overall performance.