> There are certain tasks, like improving a given program for speed, for instance, where in theory the model can continue to make progress with a very clear reward signal for a very long time.
Yup, this will absolutely be a big driver of gains in AI for coding in the near future. We actually built a benchmark based on this exact principle: https://algotune.io/