Hacker News

The vectors don't need to be orthogonal due to the use of non-linearities in neural networks. The softmax in attention let's you effectively pack as many vectors in 1D as you want and unambiguously pick them out.