My experience with benchmarks and evals is that it can take ~20 runs of a problem for the distribution of answers to start to converge. Ideally you'd know the convergence properties of your algorithm ahead of time and make a Bayesian solution that makes the uncertainty explicit.