The singular defects (or high-norm tokens) [1] may be related to attention sinks. It is interesting that the direction of all high-norm tokens share the same direction. Maybe the theory behind is not very complex and the issue can be fixed cleverly during training.