Hacker News

danieldk 5 days ago [ - ]

The attention sink as used in gpt-oss is similar to your link. But rather than adding one to the denominator, they add a trainable 'logit' (a different logit for each head).