I ran NoLiMa on Quasar Alpha (GPT-4.1's stealth mode): https://news.ycombinator.com/item?id=43640166#43640790
Updated results from the authors: https://github.com/adobe-research/NoLiMa
It's the best known performer on this benchmark, but still falls off quickly at even relatively modest context lengths (85% perf at 16K). (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might outperform it.)