Hacker News

They absolutely are. The “maximum context window” of a model is a byproduct of the context length it was trained on.

If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128