I lost a ton of respect for the author when he started talking about speech recognition.
He makes a few claims:
(1) That speech recognition isn't end to end because it requires highly sophisticated mathematically crafted preprocessing.
(2) That this is evidence human learning is more sophisticated than deep learning.
So (1) is just nonsense. It was true 10 years ago but wasn't true 6 years ago. And if he's that far out of date, that really poisons my ability to trust him.
And (2) misses some important knowledge about how humans work, which most speech recognition researchers know about. The human ear actually does it's own version of Fourier decomposition by using different length hairs in the ear. The human body does a ton of evolved preprocessing. Given that we could develop in decades audio preprocessing that took evolution millenia to build, we seem to be doing pretty well.
> [preprocessing] was true 10 years ago but wasn't true 6 years ago
Can you say more? What are some examples of speech recognition systems that don't need this preprocessing?