Hacker News

Maybe another analogy (or at least the way I intuitively understood it) was that humans sometimes skip over tokens we know to be fluff/filler. Without a sink, models have no way to "skip" over a token, that token _will_ attend to all previous tokens and be incorporated into the residual stream. It's easy to see that for filler tokens this will tend to hurt quality more than improve it, since you're more likely to pull in noise than if you could somehow "skip" that token entirely.

yorwba 2 days ago [ - ]

Not quite. If some values are filler and some are not, and the corresponding keys are linearly separable, it's not difficult to find a query where standard attention gives low attention scores to filler and high attention scores to non-filler. Attention sinks deal with the problem when everything is filler, so there's no non-filler to allocate attention to.