Hacker News

Hmm not sure I agree with you there entirely. You're right there's queues to ensure that you max out the hardware with concurrent batches to _start_ inference, but I doubt you'd want to split up the same job into multiple bits and move them around servers if you could at all avoid it.

It requires a lot of bandwidth to do that and even at 400gbit/sec it would take a good second to move even a smaller KV cache between racks even in the same DC.