Hacker News

I am not sure about public evidence. But the memory requirements alone to train on 1M long windows would make it a very unrealistic proposition compared to RoPE scaling. And as I mentioned RoPE is essential for long context anyway. You can't train it in the "normal way". Please see the paper I linked previously for more context (pun not intended) on RoPE.

Re: Llama 4, please see the sibling comment.