Raw audio data is unnatural. Ear doesn't capture pressure samples thousands of times per second. It captures frequencies and sonic energy carried by them. Result of doing a spectrogram on the raw data is what comes out raw out of our biological sensor.
I'm breaking the commenting rules to say this, but this strikes me as a valuable insight. Thanks!