The Mel-scale spectrogram doesn't do anything specific to reduce noise compared to an FFT. It's just preferred for traditional speech recognition because it uses a non-linear frequency scale that better matches human perception.

Speech recognition is based around recognizing the frequency correlates of speech generation/articulation, mainly the frequency bands that are attenuated by vocal tract resonances as articulation changes the shape of the vocal tract.

The fundamental frequency, f0, of someone's voice is not important to speech recognition - that is just the frequency with which their vocal chords are opening and closing, corresponding to a high pitched voice (e.g. typical female or child) vs a low pitched one (male).

What happens during speech production is that due to the complex waveform generated by the asymmetrically timed opening and closing of the vocal chords (slow open, fast close), not only is the fundamental frequency, f0, generated, but also harmonics of it - 2xf0, 3xf0, 4xf0, etc. The resonances of the vocal tract then attenuate certain frequency ranges within this spectrum of frequencies, and it's these changing attenuated frequency ranges, aka formants, that effectively carry the articulation/speech information.

The frequency ranges of the formats also vary according to the length of the vocal tract, which varies between individuals, so it's not specific frequencies such as f0 or its harmonics that carry the speech information, but rather changing patterns of attenuation (formants).