> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.

This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.

The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.

Maybe another analogy (or at least the way I intuitively understood it) was that humans sometimes skip over tokens we know to be fluff/filler. Without a sink, models have no way to "skip" over a token, that token _will_ attend to all previous tokens and be incorporated into the residual stream. It's easy to see that for filler tokens this will tend to hurt quality more than improve it, since you're more likely to pull in noise than if you could somehow "skip" that token entirely.

Not quite. If some values are filler and some are not, and the corresponding keys are linearly separable, it's not difficult to find a query where standard attention gives low attention scores to filler and high attention scores to non-filler. Attention sinks deal with the problem when everything is filler, so there's no non-filler to allocate attention to.

Good point. Does that make them mitigate hallucinations?

In a sense? As the article notes, models trained using standard attention develop attention sinks naturally and removing them makes the model deteriorate completely, so the hallucinations you're thinking of were most likely output by a model that had already mitigated them in this way.