> In-kernel implementations and/or io-uring may improve this unfortunate situation, but today in practice it's hard to achieve the same throughput as with plain TCP.
This would depend on how the server application is written, no? Using io-uring and similar should minimise context-switches from userspace to kernel space.
> I also vaguely remember that QUIC makes load-balancing more challenging for ISPs, since they can not distinguish individual streams as with TCP.
Not just for ISPs; IIRC (and I may be recalling incorrectly) reverse proxies can't currently distinguish either, so you can't easily put an application behind Nginx and use it as a load-balancer.
The server application itself has to be the proxy if you want to scale out. OTOH, if your proxy for UDP is able to inspect the packet and determine the corresponding instance to send a UDP packet too, it's going to be much fewer resources required on the reverse proxy/load balancer, as they don't have to maintain open connections at all.
It will also allow some things more easily; a machine that is getting overloaded can hand-off (in userspace) existing streams to a freshly created instance of the server on a different machine, because the "stream" is simply related UDP packets. TCP is much harder to hand-off, and even if you can, it requires either networking changes or kernel functions to hand-off.