You’re completely overrating these benchmarks and it’s landing you at a nonsense opinion. Just actually use the models and you will see that the gap is significant.

It should be easy for a company like Anthropic to prove this beyond a doubt. Why don't they? Why don't they have a collection of prompts and side-by-side comparisons with other models showing how far ahead they are?

I think it's mainly because the difference in models at the frontier isn't "response to prompt X", but rather "coherence with 500K tokens of context and instructions in play"

Good morning to the Anthropic office good sir