There are some not so niche communities, like FlashAttention and LinearFlashAttention repos. New code/optimizations get committed on a weekly basis. They find a couple of percents here and there all the time. How useful their kernels actually are in term of producing good results remain to be seen, but their implementations are often much better (in FLOPS) compared to what were proposed in the original papers.
It's just like game optimization, cache-friendliness and memory hierarchy-awareness are huge in attention mechanism. But programming backward pass in these lower-level stacks is definitely not fun, tensor calculus breaks my brain.