Thanks for your interest!

Not necessarily. While the held-out downstream evals showed that 1T-1S setups outperformed larger populations like 4T-4S or 8T-8S on some specific benchmarks, that does not invalidate the motivation for population-based training.

The main motivation for larger populations is more diversity in both problems and solutions, which can encourage specialization and broader task coverage. Even if that diversity does not improve on some of the particular benchmarks we used, it is still arguably a desirable property.

Figure 9 in the paper, for example, shows that students trained with larger populations are exposed to a much wider range of tasks than the baseline.

Also, on average, we do see that 4v4 is the best across all benchmarks we measure.

The “creating new population members in seconds” comment refers to operating in LoRA space. The mutation and crossover operators are applied to lightweight LoRA adapters rather than full model weights, making the process very fast and memory efficient.