The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.

Maybe another analogy (or at least the way I intuitively understood it) was that humans sometimes skip over tokens we know to be fluff/filler. Without a sink, models have no way to "skip" over a token, that token _will_ attend to all previous tokens and be incorporated into the residual stream. It's easy to see that for filler tokens this will tend to hurt quality more than improve it, since you're more likely to pull in noise than if you could somehow "skip" that token entirely.

Not quite. If some values are filler and some are not, and the corresponding keys are linearly separable, it's not difficult to find a query where standard attention gives low attention scores to filler and high attention scores to non-filler. Attention sinks deal with the problem when everything is filler, so there's no non-filler to allocate attention to.

Good point. Does that make them mitigate hallucinations?

In a sense? As the article notes, models trained using standard attention develop attention sinks naturally and removing them makes the model deteriorate completely, so the hallucinations you're thinking of were most likely output by a model that had already mitigated them in this way.