I didn't check your code worked.. just copied it and ran the inner part of the loop once, but according to https://www.masswerk.at/6502/
It's about 2x faster. Your code uses 44 CPU cycles x 64
Edit: plus a branch instruction, maybe that adds 3 cycles x 64 I guess