The main difference with io_uring is you're not blocking the thread, just like O_NONBLOCK + epoll would, but don't have to rely on thread-level syscalls to do so: there's no expensive context switch to kernel mode. Using O_NONBLOCK + epoll is already async :)

In fact, in all cases, you don't know when the syscall actually starts execution even with regular calls. The only thing you're sure is the kernel "knows" about the syscall you want. However, you have absolutely no indication on whether it started to run or not.

The real question is: are the classical measures accurate? All we have is an upper bound on the time it took: I fired the write at t0 and finished reading the response at t1. This does not really change with io_uring. Batches will mostly change one fact: multiple measurements will share a t0, and possibly a t1 when multiple replies arrive at once.

Is it important? Yes and no. The most important thing in such benchmarks is for the added delay to be consistent between measurements, and when it starts to break down. So it's important if you're chasing every µs in the stack, but not if your goal is lowering the p99 which happens under heavy load. In this case, consistency between measurements is paramount in order to get histograms and such that make sense.