The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training.

[1] https://openreview.net/pdf?id=4yBnUokU2v