I think it's mainly because the difference in models at the frontier isn't "response to prompt X", but rather "coherence with 500K tokens of context and instructions in play"
I think it's mainly because the difference in models at the frontier isn't "response to prompt X", but rather "coherence with 500K tokens of context and instructions in play"