But Llama 4 Scout does badly on long context benchmarks despite claiming 10M. It scores 1 slot above Llama 3.1 8B in this one[1].
But Llama 4 Scout does badly on long context benchmarks despite claiming 10M. It scores 1 slot above Llama 3.1 8B in this one[1].
Indeed, but it does not take away the fact that long context is not trained through long content but by scaling short content instead.