This is an interesting article in general, but this is the standout piece for me:

>For example, an agent optimized with Claude 3.5 Sonnet also showed improved performance when powered by o3-mini or Claude 3.7 Sonnet (left two panels in the figure below). This shows that the DGM discovers general agent design improvements rather than just model-specific tricks.

This demonstrates a technique whereby a smaller/older/cheaper model has been used to improve the output of a larger model. This is backwards (as far as I understand). The current SOTA technique typically sees enormous/expensive models training smaller cheaper models.

If that's a generalisable result, end-users should be able to drive down their own inference costs pretty substantially.

> This demonstrates a technique whereby a smaller/older/cheaper model has been used to improve the output of a larger model. This is backwards (as far as I understand). The current SOTA technique typically sees enormous/expensive models training smaller cheaper models.

There are two separate aspects here. In this paper they improve the software around the model, not the model itself. What they're saying is that the software improvements carried over to other models, so it wasn't just optimising around model-specific quirks.

What you're describing with training large LLMs first is usually called "distillation" and it works on training the smaller LLM to match the entire distribution of tokens at once (hence it's faster in practice).

I think it's different from improving the model weights themselves, like the distillation examples you are mentioning. It's that changes to the "harness" or code running around the llm calls (which is what this is editing), persist or generalize to wrapping more powerful llms. That means they aren't all wasted when a more powerful llm comes along that the harness wasn't tuned to use.