Hacker News

Why is it surprising that, at some point, more information will lead to worse performance?

It seems obvious. Moreover, in a simple model, it seems like whatever tokens you do add have to have MORE information than the average in the existing window.

In a non-trivial model (and this is the model I would choose), since you are adding them to the end, they likely have to have MUCH more information.

Proof as always is an exercise to the reader.