Indeed, recent Flash Attention is a pain point for non CUDA.