The attention sink as used in gpt-oss is similar to your link. But rather than adding one to the denominator, they add a trainable 'logit' (a different logit for each head).
The attention sink as used in gpt-oss is similar to your link. But rather than adding one to the denominator, they add a trainable 'logit' (a different logit for each head).