Did this end up working? It sounds plausible but it needs some empirical validation.
There was skepticism last time this was posted https://news.ycombinator.com/item?id=37740932
Implementation for gpt-oss this week showed 2-3x improvements https://github.com/ggml-org/llama.cpp/pull/15157 https://www.reddit.com/r/LocalLLaMA/comments/1mkowrw/llamacp...
Yeah, attention sinks were applied to gpt-oss
There was skepticism last time this was posted https://news.ycombinator.com/item?id=37740932
Implementation for gpt-oss this week showed 2-3x improvements https://github.com/ggml-org/llama.cpp/pull/15157 https://www.reddit.com/r/LocalLLaMA/comments/1mkowrw/llamacp...
Yeah, attention sinks were applied to gpt-oss