On my gfx1030 "consumer grade hardware", ROCm means using SDMA, and that is broken for my system. Forcing `HSA_ENABLE_SDMA=0` makes it "work", but also makes loading tensors to VRAM take 15x longer.