How many systems are there that can't just spawn a thread for each task they have to work on concurrently? This has to be a system that is A) CPU or memory bound (since async doesn't make disk or network IO faster) and B) must work on ~tens of thousands of tasks concurrently, i.e. can't just queue up tasks and work on only a small number concurrently. The only meaningful example I can come up with are load balancers, embedded software and perhaps something like browsers. But e.g. an application server implementing a REST API that needs to talk to a database anyway to answer each request doesn't really qualify, since the database connection and the work the database itself does are likely much more resource intensive than the overhead of a thread.
I'm not sure this is correct mental model of what async solves
Async precisely improves disk/network I/O-bound applications because synchronous code has to waste a whole thread sitting around waiting for an I/O response (each with its own stack memory and scheduler overhead), and in something like an application server there will be many incoming requests doing so in parallel. Cancellation is also easier with async
CPU-bound code would not benefit because the CPU is already busy, and async adds overhead
See e.g. https://learn.microsoft.com/en-us/aspnet/web-forms/overview/... and https://learn.microsoft.com/en-us/aspnet/web-forms/overview/...
I have some test code that runs a comparison of Hyper pre-async (aka thread per request) vs async (via Tokio), and the pre-async version is able to process more requests per second in every scenario (I/o, CPU complex tasks, shared memory).
I'll publish my results shortly. I did these as baselines because I'm testing finishing the User Managed Concurrency Groups proposal to the linux kernel which is an extension to provide faster kernel threads (which beat both of them)
Relevant prior work: https://github.com/jimblandy/context-switch
Thank you for this! This is really helpful.
The UMCG implementation allows kernel thread context switches to happen in 150-200 microseconds, compared to the 1500-2000 microseconds for normal kernel thread context switches. My goal is to show that if UMCG could be merged into the Linux run time then then it would be competitive with async rust without the headache.
How many concurrent requests?
I'll have to check my work computer on Monday. It was 8 cpu virtual machine on a m1 Mac. the UMCG and normal threads were 1024 set on the server, the Tokio version was 2 threads per core. Just from the top of my head - the I/O bound requests topped out around 40k/second for the Tokio version, 60k/second for the normal hyper version, and 80k/second for the UMCG hyper version.
I'm pretty close to being done - I'm hoping to publish the entire GitHub repository with tests for the community to validate by next week.
UMCG is essentially an open source version of Google Fibers, which is their internal extension to the linux core for "light weight" threads. It requires you to build a user space scheduler, but that allows you to create different types of schedulers. I can not remember which scheduler showed ^ results but I have at least 6 different UMCG schedulers I was testing.
So essentially you get the benefits of something like tokio where you can have different types of schedulers optimized for different use cases, but the power of kernel threads which means easy cancellation, easy programming (at least in rust). It's still a linux thread with an entire 8mb(?) stack size, but from my testing it's far faster than what Tokio can provide, without the headache of async/await programming.
Async only exists because languages like Python and Javascript have global interpreter locks that don't play nice with threads.
Using async for languages like Rust or C++ is cargo cult by people who don't know what the hell they're doing.
[Caveat: there's a use case for async if you're doing embedded development where you don't have threads or call stacks at all.]
I read this argument ("async is for I/O-bound applications") often, but it makes no sense to me. If your app is I/O bound, how does reducing the work the (already idling!) CPU has to spend on context switching improve the performance of the system?
IO bound might mean latency but not throughput, so you can up concurrency and add batching, both of which require more concurrent requests in flight to hit your real limit. IO bound might also really mean contention for latches on the database, and different types of requests might hit different tables. Basically, I see people say they're IO bound long before they're at the limit of a single disk, so obviously they are not IO bound. Modern drives are absurdly fast. If everyone were really IO bound, we'd need 1/1000 the hardware we needed 10-15 years ago.
It sounds like you're assuming both pieces are running on the same server, which may not be the case (and if you're bottlenecked on the database it probably shouldn't be, because you'd want to move that work off the struggling database server)
Assuming for the sake of argument that they are together, you're still saving stack memory for every thread that isn't created. In fact you could say it allows the CPU to be idle, by spending less time context switching. On top of that, async/await is a perfect fit for OS overlapped I/O mechanisms for similar reasons, namely not requiring a separate blocking thread for every pending I/O (see e.g. https://en.wikipedia.org/wiki/Overlapped_I/O, https://stackoverflow.com/a/5283082)
Right, I think the argument should be that transitioning from a synchronous to asynchronous programming model can improve the performance of a previously CPU/Memory-bound system so that it saturates the IO interface.
If the system is CPU-bound doing useful work, that's not the case. Async shines when there are a lot of "tasks" that are not doing useful work, because they are waiting (e.g. on I/O). Waiting threads waste resources. That's what async greatly improves.
The simplest example is that you can easily be wasteful in your use of threads. If you just write blocking code, you will block the thread while waiting on io, and threads are a finite resource.
So avoiding that would mean a server can handle more traffic before running into limits based on thread count.
Inversion of thought pattern: Why is a thread such a waste that we can't have one per concurrent request? Make threads less wasteful instead. Go took things in this direction.
How do you suggest we just "make threads less wasteful"?
I mean, I suppose we could move the scheduling and tracking out of kernel mode and into user mode...
But then guess what we've just reinvented?
[delayed]
Pretty much anything that needs performance and has a lot of relatively light operations is not a candidate for spawning a thread. Context switching and the cost of threads is going to kill performance. A server spawning a thread per request for relatively lightweight request is going to be extremely slow. But sure, if every REST call results in a 10s database query then that's not your bottleneck. A query to a database can be very fast though (due to caches, indices, etc.) so it's not a given that just because you're talking to a database you can just spin up new threads and it'll be fine.
EDIT: Something else to consider is what if your REST calls needs to make 5 queries. Do you serialize them? Now your latency can be worse. Do you launch a thread per query? Now you need to a) synchornize b) take x5 the thread cost. Async patterns or green threads or coroutines enable more efficient overlapping of operations and potentially better concurrency (though a server that handles lots of concurrent requests may already have "enough" concurrency anyways).
Server applications don’t spawn threads per request, they use thread pools. The extra context switching due to threads waiting for I/O is negligible in practice for most applications. Asynchronous I/O becomes important when the number of simultaneous requests approaches the number of threads you can have on your system. Many applications don’t come close to that in practice.
There’s a benefit in being able to code the handling of a request in synchronous logic. A case has to be made for the particular application that it would cause performance or resource issues, before opting for asynchronous code that adds more complexity.
Thread pools are another variation on the theme. But if your threads block then your pool saturates and you can't process any more requests. So thread pools still need non-blocking operations to be efficient or you need more threads. If you have thread pools you also need a way of communicating with that pool. Maybe that exists in the framework and you don't worry about it as a developer. If you are managing a pool of threads then there's a fair amount of complexity to deal with.
I totally agree there are applications for which this is overkill and adds complexity. It's just a tool in the toolbox. Video games famously are just a single thread/main loop kind of application.
There’s also a really good operational benefit if you have limits like total RAM, database connections, etc. where being able to reason about resource usage is important. I’ve seen multiple async apps struggle with things like that because async makes it harder to reason about when resources are released.
Could you point out the issue here?
Why does async make it harder to reason about when resources are released?
Basically it’s the non-linear execution flow creating situations which are harder to reason about. Here’s an example I’m trying to help a Node team fix right now: something is blocking the main loop long enough that some of the API calls made in various places are timing out or getting auth errors due to the signature expiring between when the request was prepared and when it is actually dispatched because that’s sporadically tend of seconds instead of milliseconds. Because it’s all async calls, there are hundreds of places which have to be checked whereas if it was threaded this class of error either wouldn’t be possible or would be limited to the same thread or an explicit synchronization primitive for something like a concurrency limit on the number of simultaneous HTTP requests to a given target. Also, the call stack and other context is unhelpful until you put effort into observability for everything because you need to know what happened between hitting await and the exception deep in code which doesn’t share a call stack.
Because async usually means you've stopped having "call stack" as a useful abstraction.
> Context switching
No such thing. In a preemptive multitasking OS (that's basically all of them today) you will get context switching regardless of what you do. Most modern OS's don't even give you the tools to mess with the scheduler at all; the scheduler knows best.
That's not accurate. Preemptive multitasking just means your thread will get preempted. Blocking still incurs additional context switching. The core your thread is running on isn't just going to sit idle while your thread blocks.
Async does make nvme io faster because queueing multiple operations on the nvme itself is faster.
This is outside of my expertise, but wouldn't multiple threads each submitting a single operation in parallel have the same effect?
That is still “async” considering what gp wrote.
Because they wrote “thread per task” which I assume to mean something like “each os thread handles the work submitted by one user”.
This is beside the point but, something like io_uring is still significantly better than doing threadpool nvme io.
I agree: fork is fast, cheap and easy. If you're spawning something for significant work it tends to be in the noise.
Linux kernel uses 8k stacks (TBH, it's been a while), but there's also some copy-on-write overhead. Still, this is not the C10k problem...
I think it's another case of the whole industry being driven by the needs of the very small number of systems that need to handle >10k concurrent requests.
Or biases inherited from deploying on single or dual core 32-bit systems from 20 years ago.
Honestly, it's a mostly obsolete approach. OS threads are fast. We have lots of cores. The cost of bouncing around on the same core and losing L1 cache coherence is higher than the cost of firing up a new OS thread that could land on a new core.
The kernel scheduler gets tuned. Language specific async runtimes are unlikely to see so many eyeballs.