Yeah, it might be about isolating the APU as much as possible from potential sources of noise. Not that they ever put optical isolators on the data lines between the APU and CPU, but just keeping them out of phase probably helped a lot.

Another bit of evidence for that: While they merged all the audio chips into a single S-APU chip, and both PPUs and the CPU into the 1CHIP, they never went the final step of merging the APU, PPU and CPU into a single chip. And they never shrunk the PCB to move the two chips closer.

------------

My other theory is that if the audio clock was derived from the video clock, then it would have a different sample rate on NSTC and PAL consoles; By giving it an independent crystal, they can make sure both models have the same audio sample rate.

It's probably a combination of many of these small factors prevented them from ever going to the effort of trying to make it work from a single crystal.