This is really interesting. I always wondered how it works.

Couple of years ago I did some experiments using a surrogate for attention using a feed forward network (MLP) to avoid the quadratic explosion.

It worked but had problems at the time, and my mind wasn't really in it.

This has dug it back out again with the benefit of time and additional insights.

So now I'm thinking, you can use a lot of the insights in the work here, but also shoot for a full linear scaling surrogate.

The trick is to use the surrogate as a discriminator under an RL regime during training.

Instead of just applying better/faster math and optimizations alone, have the model learn to work with a fundamentally better inference approach during training.

If you do that, you can turn the approximation error present in the FFN surrogate inference method into a recovery signal encoded into the model itself.

I haven't tried it, but don't see a reason it shouldn't work. Will give it a go on a GPT-2 model ASAP.

Thanks again for the awesome article.