Something that could help is to use llvm-mca or similar to get an idea of the potential speedup.

A basic block simulator like llvm-mca is unlikely to give useful information here, as memory access is going to play a significant part in the overall performance.