This all reminds me of the bias term of a perceptron.
And with transformers we started with not having any and the network repurposed one of the inputs for that which annoyed people because now dropping this particular input makes the whole thing be unreasonably affected but also annoyed some other people because weight on that input was unreasonably high because it sort of balanced all others.
So initially people (from hanlab) tried to affix this input so it doesn't get dropped. Then they (from openai this time) decided to just skip the input by providing learnable bias inside of the network (doing the thing that classical perceptron does) and now this guy proposes further optimization just by setting bias to 1 everywhere, which might work perfectly fine since we don't really care about absolute values because ultimately we just pick largest one and don't care what it was. So in training all other weights just get scaled by the bias so it can be 1. It's little like doing physics calculations with speed of light set to 1.
If you have simple feed forward network of perceptrons where ultimately in the end you just pick the largest output and don't care about absolute values then maybe you'd also be fine with just setting all perceptron bias terms to 1 and excluding them from the learning.
Is bias learnable in biological neurons? Doesn't activation potential threshold (or whatever it's called) rely on some chemistry and isn't the same for all neurons?
The attention sink as used in gpt-oss is similar to your link. But rather than adding one to the denominator, they add a trainable 'logit' (a different logit for each head).
It would be super funny if it was sufficient.
This all reminds me of the bias term of a perceptron.
And with transformers we started with not having any and the network repurposed one of the inputs for that which annoyed people because now dropping this particular input makes the whole thing be unreasonably affected but also annoyed some other people because weight on that input was unreasonably high because it sort of balanced all others.
So initially people (from hanlab) tried to affix this input so it doesn't get dropped. Then they (from openai this time) decided to just skip the input by providing learnable bias inside of the network (doing the thing that classical perceptron does) and now this guy proposes further optimization just by setting bias to 1 everywhere, which might work perfectly fine since we don't really care about absolute values because ultimately we just pick largest one and don't care what it was. So in training all other weights just get scaled by the bias so it can be 1. It's little like doing physics calculations with speed of light set to 1.
If you have simple feed forward network of perceptrons where ultimately in the end you just pick the largest output and don't care about absolute values then maybe you'd also be fine with just setting all perceptron bias terms to 1 and excluding them from the learning.
Is bias learnable in biological neurons? Doesn't activation potential threshold (or whatever it's called) rely on some chemistry and isn't the same for all neurons?
The attention sink as used in gpt-oss is similar to your link. But rather than adding one to the denominator, they add a trainable 'logit' (a different logit for each head).
Did this end up working? It sounds plausible but it needs some empirical validation.
There was skepticism last time this was posted https://news.ycombinator.com/item?id=37740932
Implementation for gpt-oss this week showed 2-3x improvements https://github.com/ggml-org/llama.cpp/pull/15157 https://www.reddit.com/r/LocalLLaMA/comments/1mkowrw/llamacp...
Yeah, attention sinks were applied to gpt-oss