Stackful coroutines make sense when you have the RAM for it.
I've been using Zig for embedded (ARM Cortex-M4, 256KB RAM) mainly for memory safety with C interop. The explicitness around calling conventions catches ABI mismatches at compile-time instead of runtime crashes.
I actually prefer colored async (like Rust) over this approach. The "illusion of synchronous code" feels magical, but magic becomes a gotcha in larger codebases when you can't tell what's blocking and what isn't.
All synchronous code is an illusion created in software, as is the very notion of "blocking". The CPU doesn't block for IO. An OS thread is a (scheduled) "stackful coroutine" implemented in the OS that gives the illusion of blocking where there is none.
The only problem is that the OS implements that illusion in a way that's rather costly, allowing only a relatively small number of threads (typically, you have no more than a few thousand frequently-active OS threads), while languages, which know more about how they use the stack, can offer the same illusion in a way that scales to a higher number of concurrent operations. But there's really no more magic in how a language implements this than in how the OS implements it, and no more illusion. They are both a mostly similar implementation of the same illusion. "Blocking" is always a software abstraction over machine operations that don't actually block.
The only question is how important is it for software to distinguish the use of the same software abstraction between the OS and the language's implementation.
Unfortunately, the illusion of an OS thread relies on keeping a single consistent stack. Stackful coroutines (implemented on top of kernel threads) break this model in a way that has many detrimental effects; stackless ones do not.
It is true that in some languages there could be difficulties due to the language's idiosyncrasies of implementation, but it's not an intrinsic difficulty. We've implemented virtual threads in the JVM, and we've used the same Thread API with no issue.
Yep the JVM structured concurrency implementation is amazing. One thing I got wondering especially when reading this post on HN though is if stackless coroutines could (have) fit the JVM in some way to get even better performance for those who may care.
They wouldn't have had better performance, though. There is no significant performance penalty we're paying, although there's a nuance here that may be worth pointing out.
There are two different usecases for coroutines that may tempt implementors to address with a single implementation, but the usecases are sufficiently different to separate into two different implementations. One is the generator use case. What makes it special is that there are exactly two communicating parties, and both of their state may fit in the CPU cache. The other use case is general concurrency, primarily for IO. In that situation, a scheduler juggles a large number of user-mode threads, and because of that, there is likely a cache miss on every context switch, no matter how efficient it is. However, in the second case, almost all of the performance is due to Little's law rather than context switch time (see my explanation here: https://inside.java/2020/08/07/loom-performance/).
That means that a "stackful" implementation of user-mode threads can have no significant performance penalty for the second use case (which, BTW, I think has much more value than the first), even though a more performant implementation is possible for the first use case. In Java we decided to tackle the second use case with virtual threads, and so far we've not offered something for the first (for which the demand is significantly lower).
What happens in languages that choose to tackle both use cases with the same construct is that they gain negligible performance in the second use case (at best), but they're paying for that negligible benefit with a substantial degradation in user experience. That's just a bad tradeoff, but some languages (especially low-level ones) may have little choice, because their stackful solution does carry a significant performance cost compared to Java because of Java's very efficient heap memory management.
The OS allocates your thread stack in a very similar way that a coroutine runtime allocates the coroutine stack. The OS will swap the stack pointer and a bunch more things in each context switch, the coroutine runtime will also swap the stack pointer and some other things. It's really the same thing. The only difference is that the runtime in a compiled language knows more about your code than the OS does, so it can make assumptions that the OS can't and that's what makes user-space coroutines lighter. The mechanisms are the same.
And the stackless runtime will use some other register than the stack pointer to access the coroutine's activation frame, leaving the stack pointer register free for OS and library use, and avoiding the many drawbacks of fiddling with the system stack as stackful coroutines do. It's the same thing.
The new Zig IO will essentially be colored, but in a nicer way than Rust.
You don't have to color your function based on whether you're supposed to use in in an async or sync manner. But it will essentially be colored based on whether it does I/O or not (the function takes IO interface as argument). Which is actually important information to "color" a function with.
Whether you're doing async or sync I/O will be colored at the place where you call an IO function. Which IMO is the correct way to do it. If you call with "async" it's nonblocking, if you call without it, it's blocking. Very explicit, but not in a way that forces you to write a blocking and async version of all IO functions.
The Zio readme says it will be an implementation of Zig IO interface when it's released.
I guess you can then choose if you want explicit async (use Zig stdlib IO functions) or implicit async (Zio), and I suppose you can mix them.
> Stackful coroutines make sense when you have the RAM for it.
So I've been thinking a bit about this. Why should stackful coroutines require more RAM? Partly because when you set up the coroutine you don't know how big the stack needs to be, right? So you need to use a safe upper bound. While stackless will only set up the memory you need to yield the coroutine. But Zig has a goal of having a built-in to calculate the required stack size for calling a function. Something it should be able to do (when you don't have recursion and don't call external C code), since Zig compiles everything in one compilation unit.
Zig devs are working on stackless coroutines as well. But I wonder if some of the benefits goes away if you can allocate exactly the amount of stack a stackful coroutine needs to run and nothing more.
This is not true. Imagine code like this:
Can you guess if it's going to do blocking or non-blocking I/O read?The io parameter is not really "coloring", as defined by the async/await debate, because you can have code that is completely unaware of any async I/O, pass it std.Io.Reader and it will just work, blocking or non-blocking, it makes no difference. Heck, you even even wrap this into C callbacks and use something like hiredis with async I/O.
Stackful coroutines need more memory, because you need to pre-allocate large enough stack for the entire lifetime. With stackless coroutines, you only need the current state, but with the disadvantage that you need frequent allocations.
> Stackful coroutines need more memory, because you need to pre-allocate large enough stack for the entire lifetime. With stackless coroutines, you only need the current state, but with the disadvantage that you need frequent allocations.
This is not quite correct -- a stackful coroutine can start with a small stack and grow it dynamically, whereas stackless coroutines allocate the entire state machine up front.
The reason why stackful coroutines typically use more memory is that the task's stack must be large enough to hold both persistent state (like local variables that are needed across await points) and ephemeral state (like local variables that don't live across await points, and stack frames of leaf functions that never suspend). With a stackless implementation, the per-task storage only holds persistent state, and the OS thread's stack is available as scratch space for the current task's ephemeral state.
> You don't have to color your function based on whether you're supposed to use in in an async or sync manner. But it will essentially be colored based on whether it does I/O or not (the function takes IO interface as argument). Which is actually important information to "color" a function with.'
The Rust folks are working on a general effect system, including potentially an 'IO' effect. Being able to abstract out the difference between 'sync' and 'async' code is a key motivation of this.
> when you can't tell what's blocking and what isn't.
Isn't that exactly why they're making IO explicit in functions? So you can trace it up the call chain.