Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.

I think you meant commutative.

Attention also has some specific properties.

And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? https://news.ycombinator.com/item?id=44055347