The current AI boom has more to do with NVIDIA, and the popularity of computer gaming giving us GPU compute, than who was using neural networks back in 1990's.
More specifically, it was really AlexNet, the 2012 ImageNet entry, running on two NVIDIA GTX 580's, that highlighted the practicality and utility of running large scale neural nets on affordable hardware. CUDA had been released in 2006, but cuDNN (the CUDA library for neural nets) didn't come out until 2014 - after AlexNet had already kickstarted the demand.
What followed from AlexNet was a few years of intense competition on the ImageNet benchmark, and larger and larger/deeper neural nets (CNNs), which gave rise to a lot of the algorithms and concepts still used today such as residual connections (originally from ResNet), ADAM (training algorithm), ReLU/etc, normalization, dropout, etc... all the fundamentals that made building large neural nets possible.
Schmidhuber's continual reminding everyone that he was working on neural nets back in the 1990s is beyond tiresome. Yes, he should have been recognized alongside Hinton/Bengio/LeCun as one of the pioneers, but time for him to get over it.
I agree. I also think it's about the hardware and, obviously, recognizing AD as the fundamental primitive.
Particular architectures don't matter so much yet. It's quite possible that S3-Mamba or xLSTM could be used in lieu of transformers and we would still have LLMs.
No doubt some aspects of the Transformer architecture are fungible, but as Hochreiter is implicitly proving you can't just scale up an LSTM and get Transformer level performance out of it, which is why he has come up with this new xLSTM architecture to try to do better!
The short 2K Transformer context size that Hochreiter is using for xLSTM comparisons seems a bit suspect ... Of course the attraction of an RNN is that it has "infinite" context/memory, so it may be expected to outperform a short context Transformer, while at the same time context scalability is an issue for RNNs, even an LSTM. Has he just cherry picked the size at which the advantages of an xLSTM outweigh the disadvantages ?
Note that despite the table saying GPT-3, he isn't actually testing against GPT-3 (a 175B model), but rather a 400M GPT closer to GPT-1 in size. The only reason he's calling it "GPT-3" is because of the 2K context size.
Could a 1T param xLSTM one-shot a compiler or find a needle in a 1M token haystack? Does an induction-head-like AB => A'B' in-context learning primitive, or something functionally equivalent, emerge out of stacked xLSTM layers?
At the end of the day it's prediction power that matters, not specific architecture, but we've yet to see any other architecture that functionally competes with a large Transformer. It would be neat to see a significantly different one that did!
And Google's acquisition of DNN Research to get the ball rolling with conv nets and AI moneyball, followed by the acquisition of Deepmind. Schmidhuber IMO *has* been recognized as one of the 4 horseman and rightly so, but what has he done lately? Just noticed they now say the 3 godfathers of AI. This is what people hate about academia. It's not academia itself, it's the mean girl politics that emerge from the tenure system. And at this point, tenure should be abolished IMO having been utterly weaponized to defend the status quo.
Thanks AI for destroying my hobby. :)
This is well put.
2012 really fundamentally changed everything for the AI community, I’d argue because tensorflow/keras/pytorch followed and that made the infrastructure accessible for distributed training.
> The current AI boom has more to do with NVIDIA, and the popularity of computer gaming giving us GPU compute, than who was using neural networks back in 1990's
I disagree. But more critically, I'd argue it's the legacy of the PDP project that led to what became foundation models today.
The PDP project was very early - relevant in term of neural net history of course, but hard to see much there relevant to today's large models other than Hinton's reinvention of SGD as an alternative to the layer-wise training that was then the norm.
One interesting thing to note from the PDP handbook are mentions by LeCun and Hinton of what would later be called CNNs, which LeCun claims to have invented. It seems that Hinton deserves just as much credit as LeCun, and in any case these are discussed just as locally connected models using shared weights as an optimization.