I've never worked in that area, but recall reading about how images of spectrograms are often superior inputs to neural nets in comparison to the raw audio data.
I've never worked in that area, but recall reading about how images of spectrograms are often superior inputs to neural nets in comparison to the raw audio data.
Raw audio data is unnatural. Ear doesn't capture pressure samples thousands of times per second. It captures frequencies and sonic energy carried by them. Result of doing a spectrogram on the raw data is what comes out raw out of our biological sensor.
I'm breaking the commenting rules to say this, but this strikes me as a valuable insight. Thanks!
Speech to text and text to speech typically operate on the audio spectrogram, specifically the Mel-scale spectrum. This is a filtered spectrogram that decreases the noise in the data. Thus, they are not working on the images of these spectra but the computed values -- each spectral slice will be a matrix row or column of values.
The theory is that vowels and voiced consonants have a fundamental frequency and 5-6 frequencies above that. For vowels the first two frequencies are enough to identify the vowel. For rhotic vowels (r-sounding vowels like American stARt) the 3rd frequency is important.
By converting the audio to the Mel-scale spectrum, it is easier to detect these features. Text to speech using the Mel-spectrum works by modelling and generating these values, which is often easier as the number of parameters is lower and the data is easier to work with [1].
[1] There are other approaches to text to speech such as overlapping short audio segments.
The Mel-scale spectrogram doesn't do anything specific to reduce noise compared to an FFT. It's just preferred for traditional speech recognition because it uses a non-linear frequency scale that better matches human perception.
Speech recognition is based around recognizing the frequency correlates of speech generation/articulation, mainly the frequency bands that are attenuated by vocal tract resonances as articulation changes the shape of the vocal tract.
The fundamental frequency, f0, of someone's voice is not important to speech recognition - that is just the frequency with which their vocal chords are opening and closing, corresponding to a high pitched voice (e.g. typical female or child) vs a low pitched one (male).
What happens during speech production is that due to the complex waveform generated by the asymmetrically timed opening and closing of the vocal chords (slow open, fast close), not only is the fundamental frequency, f0, generated, but also harmonics of it - 2xf0, 3xf0, 4xf0, etc. The resonances of the vocal tract then attenuate certain frequency ranges within this spectrum of frequencies, and it's these changing attenuated frequency ranges, aka formants, that effectively carry the articulation/speech information.
The frequency ranges of the formats also vary according to the length of the vocal tract, which varies between individuals, so it's not specific frequencies such as f0 or its harmonics that carry the speech information, but rather changing patterns of attenuation (formants).
I used to work with the guy who solved voice recognition for google and he said that to improve the quality significantly he spent so much time looking at spectrograms of speech that he could just glance at a spectrogram and perceive what was being said.