> Regular code was already async. When you need to wait for an async operation, the thread sleeps until ready and the kernel abstracts it away

Not really. I’ve observed async code often is written in such a way that it doesn’t maximize how much concurrency can be expressed (eg instead of writing “here’s N I/O operations to do them all concurrently” it’s “for operation X, await process(x)”). However, in a threaded world this concurrency problem gets worse because you have no way to optimize towards such concurrency - threads are inherently and inescapably too heavy weight to express concurrency in an efficient way.

This is is not a new lesson - work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s this is why Apple developed GCD. Threads simply don’t provide any richer information it needs in the scheduler to the kernel about the workload and kernel threads are an insanely heavy mechanism for achieving fine grained concurrency and even worse when this concurrency is I/O or a mixed workload instead of pure compute that’s embarrassingly easily to parallelize.

Do all programs need this level of performance? No, probably not. But it is significantly more trivial to achieve a higher performance bar and in practice achieve a latency and throughput level that traditional approaches can’t match with the same level of effort.

You can tell async is directionally kind of correct in that io_uring is the kernel’s approach to high performance I/O and it looks nothing like traditional threading and syscalls and completion looks a lot closer to async concurrency (although granted exploiting it fully is much harder in an async world because async/await is an insufficient number of colors to express how async tasks interrelate)

> work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s

Well, we know how to make "traditional threads" fast, with lower latency and more consistent P99 since forever^2, in the early 90s. [1]

Sure, we can't convince that Finnish guy this is worthwhile to include in THE kernel, despite similar ideas had been running in Google datacenters for idk how many years, 15 years+? But nothing stops us from doing it in the userspace, just as you said, a work stealing executor. And no, no coloring.

Stack is all you need. Just make your "coroutines" stackful. Done. All those attempts trying to be "zero-cost" and change programming model dramatically to avoid a stack, introduced much more overhead than a stack and a piece of decent context switch code.

> You can tell async is directionally kind of correct in that io_uring is the kernel’s approach

lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.

[1] https://dl.acm.org/doi/10.1145/121132.121151

Stackful coroutines give up a fair amount of efficiency in a number of places to make that workable. It’s fine if you want to use a lot more RAM and Go and Java make that tradeoff, but that’s not suitable for something like Rust. That’s why Rust and C++’s async implementation is rather similar in many ways. Stackful coroutines also play havoc with FFI which carries a huge FFI cost penalty across the board even for code that doesn’t care about coroutines. These aren’t theoretical tradeoffs to just hand wave away as “doesn’t matter” - it literally does for how Rust is positioned. No one is stopping you from using Go or the JVM if that’s the ecosystem you like better.

> lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.

Not really. People latched on to async cancellation issues as intractable due to one paper but I’m not convinced it’s unsolvable whether due to runtimes that consider the issue more fundamentally or the language adding async drop which lets the existing runtimes solve the problem wholesale.

The point I’m making is that I/O and hardware is fundamentally non-blocking and we will always pay a huge abstraction penalty to try to pretend we have a synchronous programming model.

There are always trade-offs and there is never one best way to do something.

Stack-based coroutines is one way to do it. A relevant trade-off here is overhead, requiring a runtime and narrowing the potential use-cases this can serve (i.e embedded real-time stuff).

If you don’t care about supporting such use cases you can of course just create a copy of goroutines and be pretty happy with the result.

>despite similar ideas had been running in Google datacenters for idk how many years

I guess this is referring to https://www.youtube.com/watch?v=KXuZi9aeGTw ?

[deleted]

I am not saying threads are the model for all programming problems. For example a dependency graph like an excel spreadsheet can be analyzed and parallelized.

But as you observed, async/await fails to express concurrency any better. It’s also a thread, it’s just a worse implementation.

That’s incorrect. Even when expressed suboptimally, it still tends to result in overall higher throughput and consistently lower latency (work stealing executors specifically). And when you’re in this world, you can always do an optimization pass to better express the concurrency. If you’ve not written it async to start with, then you’re boned and have no easy escape hatch to optimize with.

Why can’t you do the same optimization? Are you maxing out you OS system resources on thread overhead?

That’s part of it. Then you add a thread pool to dispatch your tasks into to mitigate the cost of a thread start. Then you run into blocking problems and are like “I wish I had some keyword to express when a function needed to be run on the thread pool”. Then you’ve done a speed run of the past 40 years of research.

The 40 years of research was actually in OS theory so that you could write normal code and async was abstracted away.

A thread pool is not a research project.

Although they can be used in similar ways they work very differently.

* Cooperative vs. preemptive scheduling

* Userspace vs. kernel scheduling

* Stackless vs. stackful

* Easy control over waiting/blocking behavior vs. none

* Easy fan out + join vs. maybe, with some work and thread spawn overhead

* Can integrate within a single-threaded event loop vs. not really

Depending on what you're doing they may be interchangeable or you can only go one way. The basic cases where you're doing basically synchronous work in a thread/task is no different either way, other than having colored functioned with async/await and efficiency. If you're doing some UI work then event handlers are likely running in a single-threaded event loop which is the only thread you can interact with the UI on, which you can't block or the UI is going to freeze.

OS thread overhead can be pretty substantial. Starting new threads on Windows is especially expensive.

> threads are inherently and inescapably too heavy weight to express concurrency in an efficient way

Your premise is wrong. There are many counterexamples to this.

Can you explain more ? I always heard this.

The most promiment example is probably Go with its goroutines, but there are so many more. You can easily spawn tens of thousands of goroutines, with low overhead and great performance.

Goroutines/"fibers"/"green threads" are usually scheduled by the runtime system across a small pool of actual OS threads.

The word "thread" is confusing things. In computer science a thread represents a flow of execution, which in concrete terms where execution is a series of function calls, is typically a program counter and a stack.

There are many ways to implement and manage threads. In Unix-like and Windows systems a "thread" is the above, plus a bunch of kernel context, plus implicit preemptive context switching. Because Unix and Windows added threads to their architectures relatively late in their development, each thread has to behave sort of like its own process, capable of running all the pre-existing software that was thread-agnostic. Which is why they have implicit scheduling, large userspace stacks, etc.

But nothing about "thread" requires it to be implemented or behave exactly like "OS threads" do in popular operating systems. People wax on about Async Rust and state machines. Well, a thread is already state machine, too. Async Rust has to nest a bunch of state machine contexts along with space for data manipulated in each function--that's called a stack. So Async Rust is one layer of threading built atop another layer of threading. And it did this not because it's better, but primarily because of legacy FFI concerns and interoperability with non-Rust software that depended on the pre-existing ABIs for stack and scheduling management.

Go largely went in the opposite direction, embracing threads as a first-class concept in a way that makes it no less scalable or cheap than Rust Futures, notwithstanding that Go, too, had to deal with legacy OS APIs and semantics, which they abstracted and modeled with their G (goroutine), M (machine), P (processor) architecture.

I thought it was obvious from context: OS threads are too heavyweight for fine grained concurrency

Go uses userspace threads. It’s also interesting that Go and Java are the only mainstream languages to have gone this route. The reason is that it has a huge penalty when calling FFI of code that doesn’t use green threads whereas this cost isn’t there for async/await.

Also that you have to rewrite the entire standard library, because the kernel knows how to suspend kernel threads on syscalls, but not green threads. (Go and Java already had to do this anyway, of course.)