Thanks for commenting! Actually in this case, "the work being done" can be really fast because it can be done asynchronously. For context, here’s how this translates in a real-world application.
The original algorithm was provided by DeepSeek, and our optimized implementation achieves a 92× speedup over it. The 5x number is comparing with another baseline that is undisclosed yet.
When integrating EPLB into vLLM, I discovered—somewhat unexpectedly—that the open-source algorithm consumes nearly half of the total time of a rearrangement step, with the remaining time spent transferring weights across GPUs. To address this, I applied OpenEvolve to the algorithm, setting the primary objective to improve speed while maintaining the same balance factor. It performed remarkably well. With additional optimizations on the weight transferring, the overall overhead has now become almost negligible.
While no one will deny you (or I guess your system) the immense satisfaction of 100x improvement on a given step, I think it would be helpful to note the frequency of this rebalancing step, and to contextualize your result in terms of the runtime (or throughput) of the workload(s) you were using to evaluate.
e: also comparison a fixed (nothing faster than 0!) and random policy might be informative if your intent is to publish this as improvement for the object problem, not just a demonstration of ARDS.