worth separating: LSTM (Hochreiter & Schmidhuber 1997) is ironclad and widely cited. the transformer attention priority claims are far shakier. conflating them is how Schmidhuber undermines himself

Yes, and notable how Alex Graves, one of Schmidhuber's students, later at DeepMind, doesn't even mention Schmidhuber in his historical overview of attention mechanisms "Attention and Memory in Deep Learning".

https://www.youtube.com/watch?v=AIiwuClvH6k

When it comes to attention, details matter, since the idea itself is obvious - weighted inputs, and implicit attention is present in every neural network - this is what weights are.

The specific form of attention used by the Transformer is key-based associative attention, aka "Bahdanau attention" introduced in Bahdanau's paper "Neural Machine Translation by Jointly Learning to Align and Translate". It's perhaps worth noting that the word "attention" is barely even mentioned in this paper, other than noting that this weighted input mechanism can be seen as a form of attention (presumably mentioned since attention was at that time a recurring theme in various types of neural network).

Bahdanau attention - not just the general concept of attention - seems to be a very critical piece of the Transformer architecture since this this is what allows the Transformer to find things in context and is behind the "induction head" mechanism that appears central to how Transformers operate.