Hacker News

When you read technical papers on various models, you’ll find that they often did most of the pretraining and even the supervised fine tuning using relatively short context data; then they “extended” the context window by training on a little bit of long context data. I think this is what is meant by not being trained uniformly.

However, now that RL environments and long-horizon agentic performance have taken such a prominent role in model development, I wonder if that practice still holds. I know that the most recent Gemma and Qwen models are incomparably more reliable at long contexts than their predecessors, even though, e.g. Qwen already had a 256k context. It just didn’t work like it does now.