Yeah, it's sort of like saying the ear doesn't do "a" Fourier transform, it does a bunch of Fourier transforms on samples of data, with a varying tradeoff between temporal and frequency resolution. But most people would still say that's doing a Fourier transform.

As the article briefly mentions, it's a tempting hypothesis that there is a relationship between the acoustic properties of human speech and the physical/neural structure of the auditory system. It's hard to get clear evidence on this but a lot of people have a hunch that there was some coevolution involved, with the ear's filter functions favoring the frequency ranges used by speech sounds.

> ...it's a tempting hypothesis that there is a relationship between the acoustic properties of human speech and the physical/neural structure of the auditory system.

This seems trivially true in the sense that human speech is intelligible by humans; there are many sounds that humans cannot hear and/or distinguish, and speech does not involve those.

Yes, but at the least it's a bit more than that, because the ear is more sensitive to certain frequency ranges than others, and speech sounds seem to be more clustered in those ranges.

This is something you quickly learn when you read the theory in the textbook, get excited, and sit down to write some code and figure out that you'll have to pick a finite buffer size. :-)