What does quadratic attention mean?

I've so far found that intuition travels between models of a similar generation remarkably well. The conformance suite trick (find a 9,200 test existing conformance suite and tell an agent to build a fresh implementation that passes all those tests) I first found with GPT-5.2 turned out to work exactly as well against Claude Opus 4.5, for example.

https://arxiv.org/abs/2209.04881

To save anyone else the click, this is the paper "On The Computational Complexity of Self-Attention" from September 2022, with authors from NYU and Microsoft.

It argues that the self-attention mechanism in transformers works by having every token "attend to" every other token in a sequence, which is quadratic - n^2 against input - which should limit the total context length available to models.

This would explain why the top models have been stuck at 1 million tokens since Gemini 1.5 in February 2024 (there has been a 2 million token Gemini but it's not in wide use, and Meta claimed their Llama 4 Scout could do 10 million but I don't know of anyone who's seen that actually work.)

My counter-argument here is that Claude Opus 4.5 has a comparatively tiny 200,000 token window which turns out to work incredibly well for the kinds of coding problems we're throwing at it, when accompanied by a cleverly designed harness such as Claude Code. So this limit from 2022 has been less "disappointing" than people may have expected.

The quadratic attention problem seems to be largely solved by practical algorithmic improvements. (Iterations on flash attention, etc.)

What's practically limiting context size IME is that results seem to get "muddy" and get off track when you have a giant context size. For a single-topic long session, I imagine you get a large number of places in the context which may be good matches for a given query, leading to ambiguous results.

I'm also not sure how much work is being put into reinforcement in extremely large context inference, as it's presumably quite expensive to do and hard to reliably test.

Indeed, filling the adversitsed context more than 1/4 full is a bad idea in general. 50k tokens is a fair bit, but works out to between 1 and 10k lines of code.

Perfect for a demo or work on a single self contained file.

Disastrous for a large code base with logic scattered all throughout it.