> work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s

Well, we know how to make "traditional threads" fast, with lower latency and more consistent P99 since forever^2, in the early 90s. [1]

Sure, we can't convince that Finnish guy this is worthwhile to include in THE kernel, despite similar ideas had been running in Google datacenters for idk how many years, 15 years+? But nothing stops us from doing it in the userspace, just as you said, a work stealing executor. And no, no coloring.

Stack is all you need. Just make your "coroutines" stackful. Done. All those attempts trying to be "zero-cost" and change programming model dramatically to avoid a stack, introduced much more overhead than a stack and a piece of decent context switch code.

> You can tell async is directionally kind of correct in that io_uring is the kernel’s approach

lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.

[1] https://dl.acm.org/doi/10.1145/121132.121151

Stackful coroutines give up a fair amount of efficiency in a number of places to make that workable. It’s fine if you want to use a lot more RAM and Go and Java make that tradeoff, but that’s not suitable for something like Rust. That’s why Rust and C++’s async implementation is rather similar in many ways. Stackful coroutines also play havoc with FFI which carries a huge FFI cost penalty across the board even for code that doesn’t care about coroutines. These aren’t theoretical tradeoffs to just hand wave away as “doesn’t matter” - it literally does for how Rust is positioned. No one is stopping you from using Go or the JVM if that’s the ecosystem you like better.

> lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.

Not really. People latched on to async cancellation issues as intractable due to one paper but I’m not convinced it’s unsolvable whether due to runtimes that consider the issue more fundamentally or the language adding async drop which lets the existing runtimes solve the problem wholesale.

The point I’m making is that I/O and hardware is fundamentally non-blocking and we will always pay a huge abstraction penalty to try to pretend we have a synchronous programming model.

There are always trade-offs and there is never one best way to do something.

Stack-based coroutines is one way to do it. A relevant trade-off here is overhead, requiring a runtime and narrowing the potential use-cases this can serve (i.e embedded real-time stuff).

If you don’t care about supporting such use cases you can of course just create a copy of goroutines and be pretty happy with the result.

>despite similar ideas had been running in Google datacenters for idk how many years

I guess this is referring to https://www.youtube.com/watch?v=KXuZi9aeGTw ?

[deleted]