> Self-attention is required. The model must contain at least one self-attention layer. This is the defining feature of a transformer — without it, you have an MLP or RNN, not a transformer.

I think it would be interesting to see challenges where two networks are trained and evaluated on the exact same datasets and the architecture is the same except for the presence of self-attention layers in one network.

So far it seems to me that self-attention really brought new capabilities to a network - essentially change the network's functionality in response to the input. It would be interesting to see if there are problems (i.e. datasets) that a "traditional" feedforward network fails to solve, but a transformer network of the same size can solve.

My guess would be: yes there are, and they are the kinds of "variable task" datasets that we see with LLMs, i.e. where part of the input indicates the task itself and part indicates the data for the task.

> So far it seems to me that self-attention really brought new capabilities to a network

Do we have a layman explanation for what makes self-attention so uniquely powerful? Something more than "it lets you do self-attention".

Computational power. Without self attention, you have a sloppy implementation of something called a PDA (push-down-automaton) -- like an old HP calculator. With it, you have an even sloppier implementation of a Turing machine.

So (modulo a _lot_ of details) it increases the power from that of a "calculator" to that of a "computer".