I found a fairly large improvement in my toy transformer model where I added a "global" token akin to the CLS token in ViT.

Another approach I've seen is the "Diff transformer" from MS Research (https://github.com/microsoft/unilm/tree/master/Diff-Transfor...).