This is quite cool, but I must note that the 5x reported in the headline is the _runtime_ of the load balancing algorithm itself, not the load factor or throughput of the system or what have you.
> On average, it takes about 540 ms to re-balance the experts and achieves a load balance factor of 0.66 (calculated as the ratio of average to maximum tokens generated per GPU).
> ...
> We also consider a non-public reference implementation from a frontier lab that we have access to. This implementation avoids explicit iteration and reduces the rebalancing algorithm runtime to 19.6 ms while achieving the same balance factor as the open-source algorithm.
> ...
> The resulting algorithm matches the load balance factor of the other baselines while reducing runtime to just 3.7 ms, yielding a 5.0x speedup over the internal reference implementation.
That's a good point! The load balancing of the original algorithm was already quite good so our goal was to try to get something that could achieve similar results but could run faster since runtime was also a concern.