https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o...
IMO this is the best long context benchmark. Hopefully they will run it for the new models soon. Needle-in-a-haystack is useless at this point. Llama-4 had perfect needle in a haystack results but horrible real-world-performance.