Hacker News

Yeah, that can work. Just yesterday I benchmarked load balancing of LLM workloads across 2 GPUs using a simple least_conn from nginx. The total token/sec scaled as expected (2 GPUs => 2x token/sec), and GPU utilization reached 100% on both, as I increased concurrency from 1 to 128 simultaneous generations.