This is well established at this point, it’s called “context rot”: https://research.trychroma.com/context-rot

Yeah, though this paper doesn't test any standard LLM benchmarks like GPQA diamond, SimpleQA, AIME 25, LiveCodeBench v5, etc. So it remains hard to tell how much intelligence is lost when the context is filled with irrelevant information.