Instead of driving the agent with an llm, it might work to use the agent to hard code heuristics, and use some kind of a simulation to benchmark its skills? Then feeding the results back to the agent so it can improve the heuristics?