i actually do it differently

> (1) Let the LLM randomly perturbate the system.

instead of this i ask LLM to what's least likely to improve performance and then measure it.

sometimes big gains come from places you thought are least likely.

For sure! The hypothesis generation gotta be improved. Your take on the "least likely" is interesting. In the beginning of the repo I was having problems with "hypothesis convergence", your idea may be a nice way to introduce the much needed variability