I've been doing audio software for 25-30 years. I have no idea what sort of synthesis you'd be doing where the processor clock played any roll at all. Waveform synthesis is normally done in buffers (8 to 8192 samples), and the "clocking" to convert the sample stream into an analog waveform is done by the audio interface/DAC, not the CPU. If you were basically implementing a DAC, then yes, the clock would matter a lot ... is/was that the issue?

You've not done it long enough to have worked with machine language programs that used instruction timing to click a speaker.

This worked well in 1980's microcomputers which used an accurate, crystal oscillator clock. IC's like the MOS6502 or Intel 8086 don't have built-in clocking. The boards were large and costly enough to afford a clock; and often it was dual purposed. E.g. in Apple II machines, the master oscillator clock from which the NTSC colorburst clock was derived also supplied the CPU clock.

These processors had no caches, so instructions executed with predictable timing. Every data access or instruction fetch was a real cycle on the bus, taking the same time every time.

Code that arranged not to be interrupted could generate precise signals.

Some microcomputers used software loops to drive serial lines, lacking a UART chip for that. You could do that well enough to communicate up to around 1200 baud.

As in the Manic Miner soundtrack on the ZX Spectrum: https://cirrusretro.com/listen/5333-manic-miner-zx-spectrum (warning: loud and annoying!)

> you built what was basically a raspberry pi with a microcontroller by hand, and you had to use the dumb speaker and controller to make your own music firmware to produce notes

This sounds like they were most likely bit banging square waves into a speaker directly via a GPIO on a microcontroller (or maybe using a PWM output if they were fancy about it). In that case, the audio frequency will be derived directly from the microcontroller's clock speed, and the tolerance of an internal oscillator on a microcontroller can be as bad as 10%.

yes it was this