So, if I got this right, this is just about re-implementing an existing load balancing algorithm faster...? If so, this is really dumb. As you guys checked out, yes most load balancing algorithms are slow/dumb:

>First, we evaluate DeepSeek's open-source EPLB implementation. This employs a greedy bin-packing strategy: experts are sorted by load in descending order, and each is placed onto the least-loaded GPU that has capacity (Figure 3a, Example 1). While simple, the solution is slow because it written in Python and uses a for-loop to performs linear search for finding the best-fit GPU choice.

This is because when considering a load balancing algorithm, unless the work being done (in this case by the GPU) lasts only a few ms, the load balancing algorithm being fast will never be the bottleneck. The post does not mention whether this is the case at all.

Also, I don't want to sound rude, but if all they managed to get is a 5x increase over a simple python algorithm, I don't think this is impressive at all...? Any rewrite of the 'dumb' algorithm in a language with more memory control and cache continuity should result in much better results.

Thanks for commenting! Actually in this case, "the work being done" can be really fast because it can be done asynchronously. For context, here’s how this translates in a real-world application.

The original algorithm was provided by DeepSeek, and our optimized implementation achieves a 92× speedup over it. The 5x number is comparing with another baseline that is undisclosed yet.

When integrating EPLB into vLLM, I discovered—somewhat unexpectedly—that the open-source algorithm consumes nearly half of the total time of a rearrangement step, with the remaining time spent transferring weights across GPUs. To address this, I applied OpenEvolve to the algorithm, setting the primary objective to improve speed while maintaining the same balance factor. It performed remarkably well. With additional optimizations on the weight transferring, the overall overhead has now become almost negligible.

While no one will deny you (or I guess your system) the immense satisfaction of 100x improvement on a given step, I think it would be helpful to note the frequency of this rebalancing step, and to contextualize your result in terms of the runtime (or throughput) of the workload(s) you were using to evaluate.

e: also comparison a fixed (nothing faster than 0!) and random policy might be informative if your intent is to publish this as improvement for the object problem, not just a demonstration of ARDS.

Agree. Starting from Python for-loops is embarrassing baseline. Any decent implementation gets you most of that 5x for free. The interesting part isn't the speedup - it's that AI can do routine optimization unsupervised. That's the actual value prop.