Good analysis. I guess I lost my cycle-counting foo, because on modern CPUs cycle-counting is a) infeasible on a superscalar, speculative, out-of-order CPU, and b) even if you did manage to do it, it hardly matters when a CPU cycle is less than a nanosecond, but any memory access that slips through the caches means many orders of magnitude higher latency.

Of course none of that applies here, but it colors the way you think about things…

meanwhile the 6502 is all "cache? what's cache", every 6502 optimization is basically a pessimization for modern CPUs.