Is there any evidence that GPT-4.1 is using RoPE to scale context?
Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.
Is there any evidence that GPT-4.1 is using RoPE to scale context?
Also, I don't know about Qwen, but I know Llama 4 has severe performance issues, so I wouldn't use that as an example.
I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.
Re: Llama 4, please see the sibling comment.