I agree. I also think it's about the hardware and, obviously, recognizing AD as the fundamental primitive.
Particular architectures don't matter so much yet. It's quite possible that S3-Mamba or xLSTM could be used in lieu of transformers and we would still have LLMs.
No doubt some aspects of the Transformer architecture are fungible, but as Hochreiter is implicitly proving you can't just scale up an LSTM and get Transformer level performance out of it, which is why he has come up with this new xLSTM architecture to try to do better!
The short 2K Transformer context size that Hochreiter is using for xLSTM comparisons seems a bit suspect ... Of course the attraction of an RNN is that it has "infinite" context/memory, so it may be expected to outperform a short context Transformer, while at the same time context scalability is an issue for RNNs, even an LSTM. Has he just cherry picked the size at which the advantages of an xLSTM outweigh the disadvantages ?
Note that despite the table saying GPT-3, he isn't actually testing against GPT-3 (a 175B model), but rather a 400M GPT closer to GPT-1 in size. The only reason he's calling it "GPT-3" is because of the 2K context size.
Could a 1T param xLSTM one-shot a compiler or find a needle in a 1M token haystack? Does an induction-head-like AB => A'B' in-context learning primitive, or something functionally equivalent, emerge out of stacked xLSTM layers?
At the end of the day it's prediction power that matters, not specific architecture, but we've yet to see any other architecture that functionally competes with a large Transformer. It would be neat to see a significantly different one that did!