The numbers they show don't matter. "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6.", but what did anthropic do? They just stopped showing the benchmark altogether and then just show the cherry top ones that got improved on.
The numbers they show don't matter. "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6.", but what did anthropic do? They just stopped showing the benchmark altogether and then just show the cherry top ones that got improved on.