Perhaps a legacy of the sound stuff being designed by Sony instead of Nintendo? Designed to be self contained and not reliant on other parts of the system? On early SNES models the sound circuitry is even its own sub PCB in a metal box (although seemingly not the crystal for some reason). Plus the fact it has its own processor to run sound, instead of using the main 65816 (though sound sub-CPUs aren’t unknown in consoles, see the Mega Drives Z80).

Or someone just really cared about sound quality (see also: the metal box).

Yeah, it might be about isolating the APU as much as possible from potential sources of noise. Not that they ever put optical isolators on the data lines between the APU and CPU, but just keeping them out of phase probably helped a lot.

Another bit of evidence for that: While they merged all the audio chips into a single S-APU chip, and both PPUs and the CPU into the 1CHIP, they never went the final step of merging the APU, PPU and CPU into a single chip. And they never shrunk the PCB to move the two chips closer.

------------

My other theory is that if the audio clock was derived from the video clock, then it would have a different sample rate on NSTC and PAL consoles; By giving it an independent crystal, they can make sure both models have the same audio sample rate.

It's probably a combination of many of these small factors prevented them from ever going to the effort of trying to make it work from a single crystal.