Isn't transformer attention quadratic in complexity in terms of context size? In order to achieve 1M token context I think these models have to be employing a lot of shortcuts.
I'm not an expert but maybe this explains context rot.
Isn't transformer attention quadratic in complexity in terms of context size? In order to achieve 1M token context I think these models have to be employing a lot of shortcuts.
I'm not an expert but maybe this explains context rot.
Nope, there’s no tricks unless there’s been major architectural shifts I missed. The rot doesn’t come from inference tricks to try to bring down quadratic complexity of the KV cache. Task performance problems are generally a training problem - the longer and larger the data set, the fewer examples you have to train on it. So how do you train the model to behave well - that’s where the tricks are. I believe most of it relies on synthetically generated data if I’m not mistaken, which explains the rot.
A quick Google search reveals terms such as "sparse attention" that are used to avoid quadratic runtime.
I don't know if Anthropic has revealed such details since AI research is getting more and more secretive, but the architectural tricks definitely exist.