Interesting, I made something similar years ago when io_uring wasn't around yet and it is just a couple threads blocking on sendfile: https://github.com/evelance/sockbiter
Of course it needs to pre-generate the file and you need enough RAM for both the server running and caching the file but it needs almost zero CPU during the test run and can probably produce even more load than this io_uring tool.
I did and I still didn't see any numbers. Just a bunch of AI generated text about why it's supposedly fast. It even says it records numbers multiple times, so why aren't there any presented?
Really stupid question from someone who doesnt know much about io_uring. Wouldn't doing all this i/o async make the latency measurements less accurate? How do you know when the i/o starts if you are submitting it async in batches of 2048?
The main difference with io_uring is you're not blocking the thread, just like O_NONBLOCK + epoll would, but don't have to rely on thread-level syscalls to do so: there's no expensive context switch to kernel mode. Using O_NONBLOCK + epoll is already async :)
In fact, in all cases, you don't know when the syscall actually starts execution even with regular calls. The only thing you're sure is the kernel "knows" about the syscall you want. However, you have absolutely no indication on whether it started to run or not.
The real question is: are the classical measures accurate? All we have is an upper bound on the time it took: I fired the write at t0 and finished reading the response at t1. This does not really change with io_uring. Batches will mostly change one fact: multiple measurements will share a t0, and possibly a t1 when multiple replies arrive at once.
Is it important? Yes and no. The most important thing in such benchmarks is for the added delay to be consistent between measurements, and when it starts to break down. So it's important if you're chasing every µs in the stack, but not if your goal is lowering the p99 which happens under heavy load. In this case, consistency between measurements is paramount in order to get histograms and such that make sense.
Interesting, I made something similar years ago when io_uring wasn't around yet and it is just a couple threads blocking on sendfile: https://github.com/evelance/sockbiter
Of course it needs to pre-generate the file and you need enough RAM for both the server running and caching the file but it needs almost zero CPU during the test run and can probably produce even more load than this io_uring tool.
What is the point of making up claims of "extreme" performance without any accompanying benchmarks or comparisons?
It really should be shameful to use unqualified adjectives in headline claims without also providing the supporting evidence.
did you scroll down?
I did and I still didn't see any numbers. Just a bunch of AI generated text about why it's supposedly fast. It even says it records numbers multiple times, so why aren't there any presented?
Really stupid question from someone who doesnt know much about io_uring. Wouldn't doing all this i/o async make the latency measurements less accurate? How do you know when the i/o starts if you are submitting it async in batches of 2048?
The main difference with io_uring is you're not blocking the thread, just like O_NONBLOCK + epoll would, but don't have to rely on thread-level syscalls to do so: there's no expensive context switch to kernel mode. Using O_NONBLOCK + epoll is already async :)
In fact, in all cases, you don't know when the syscall actually starts execution even with regular calls. The only thing you're sure is the kernel "knows" about the syscall you want. However, you have absolutely no indication on whether it started to run or not.
The real question is: are the classical measures accurate? All we have is an upper bound on the time it took: I fired the write at t0 and finished reading the response at t1. This does not really change with io_uring. Batches will mostly change one fact: multiple measurements will share a t0, and possibly a t1 when multiple replies arrive at once.
Is it important? Yes and no. The most important thing in such benchmarks is for the added delay to be consistent between measurements, and when it starts to break down. So it's important if you're chasing every µs in the stack, but not if your goal is lowering the p99 which happens under heavy load. In this case, consistency between measurements is paramount in order to get histograms and such that make sense.
Its not a stupid question.
Normally when I have run latency calculations in the past I run them from the perspective of the caller, not the server.
In most cases this is over the network, a named pipe or sock file.
I guess it should be possible to run multiple runtimes inside a program that run independently.