I am still mystified as to why callback-based async seems to have become the standard. What this and e.g. libtask[1] do seems so much cleaner to me.
The Rust folks adopted async with callbacks, and they were essentially starting from scratch so had no need to do it that way, and they are smarter than I (both individually and collectively) so I'm sure they have a reason; I just don't know what it is.
Stackless coroutines can be implemented using high level language constructs, and entirely in your language. Because of this it interacts with legacy code, and existing language features in predictable ways. Some security software or code hardening and instrumentation libraries will break as well.
Also, async at low level is literally always callbacks (even processor interrupts are callbacks)
By mucking about with the stack, you break stuff like stack unwinding for exceptions and GC, debuggers, and you probably make a bunch of assumptions you shouldn't
If you start using the compiler backend in unexpected ways, you either expose bugs or find missing functionality and find that the compiler writers made some assumptions about the code (either rightfully or not), that break when you start wildly overwriting parts of the stack.
Writing a compiler frontend is hard enough as it is, and becoming an LLVM expert is generally too much for most people.
But even if you manage to get it working, should you have your code break in either the compiler or any number of widely used external tooling, you literally can't fast track your fix, and thus you can't release your language (since it depends on a broken external dependency, fix pending whenever they feel like it).
I guess even if you are some sort of superhero who can do all this correclty, the LLVM people won't be happy merging some low level codegen change that has the potential to break all compiled software of trillion dollar corporations for the benefit of some small internet project.
The research Microsoft engineers did on stackful vs stackless coroutines for the c++ standard I think swayed this as “the way” to implement it for something targeting a systems level - significantly less memory overhead (you only pay for what you use) and offload the implementation details of the executor (lots of different design choices that can be made).
Yup, stackful fibers are an anti-pattern. Here's Gor Nishanov's review for the C++ ISO committee https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p13... linked from https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=10... . Notice how it sums things up:
> DO NOT USE FIBERS!
And this is the rebuttal: https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2019/p08...
There are downsides to stackful coroutines (peak stack usage for example), but I feed that p1364 was attacking a strawman: first of all it is comparing a solution with builtin compiler support against a pure library implementation, second it is not even comparing against the reference implementation of the competing proposal.
The TL;DR of that sums up my opinions pretty well.
As an aside, I know Rust would be unlikely to implement segmented stacks for fibers, given that they were burned by the performance implications thereof previously.
> DO NOT USE FIBERS!
For C++.
If your language has RAII or exceptions, it raises crazy questions about how if thread A is hosting fiber 1, which throws an exception, which propagates outside of the fiber invocation scope, destroys a bunch of objects, then we switch to fiber 2, which sees the world in an inconsistent state (outside resources have been cleaned up, inside ones still alive).
This was literally impossible in pre-fiber code, so most existing code would probably not handle it well.
That's not different from threads running concurrent exceptions (in fact it is simpler in the single threaded example). RAII or exceptions are really not an issue for stackful coroutines.
Is stackful fibers the same as stackful coroutines?
yes same thing, different names.
Many of these issues go away if you control the compiler and runtime, which Rust does (and they needed to make changes to those to add async, so changes were inevitable).
> significantly less memory overhead
On an OS with overcommit, you might also only pay for what you use (at a page granularity), but this may be defeated if the stack gets cleared (or initialized to a canary value) by the runtime.
Not just cleared. Each stack would get dirtied to the largest depth and that memory never gets reclaimed even if you only hit that deep of stack once. And the OS’s only way to reclaim that is if the allocator frees that page (in practice rarely particularly for glibc) or it gets swapped to disk (expensive since unused pages end up getting swapped too)
> Each stack would get dirtied to the largest depth and that memory never gets reclaimed even if you only hit that deep of stack once
Right, but the issue under discussion was that "right sizing" stacks is hard. You always need to have a stack sized to at least the largest depth.
Pages that get swapped out to disk and never accessed again aren't that expensive. Disk is cheap.
One thing I would consider "unclean" about the zio approach (and e.g. libtask) is that you pass it an arbitrary expected stack size (or, as in the example, assume the default) and practically just kind of hope it's big enough not to blow up and small enough to be able to spawn as many tasks as you need. Meanwhile, how much stack actually ends up being needed by the function is a platform specific implementation detail and hard to know.
This is a gotcha of using stack allocation in general, but exacerbated in this case by the fact that you have an incentive to keep the stacks as small as possible when you want many concurrent tasks. So you either end up solving the puzzle of how big exactly the stack needs to be, you undershoot and overflow with possibly disastrous effects (especially if your stack happens to overflow into memory that doesn't cause an access violation) or you overshoot and waste memory. Better yet, you may have calculated and optimized your stack size for your platform and then the code ends up doing UB on a different platform with fewer registers, bigger `c_long`s or different alignment constraints.
If something like https://github.com/ziglang/zig/issues/157 actually gets implemented I will be happier about this approach.
Maybe I've been on x64 Linux too long, but I would just specify 8MB of stack for each fiber and let overcommit handle the rest. For small fibers that would be 4k per fiber of RSS so a million fibers is 4GB of RAM which seems fine to me?
Couldn’t you use the Go approach of starting with a tiny stack that is big enough for 90% of cases, then grow it as needed?
Go depends on the fact that it can track all pointers, and when it needs to resize stacks, it can update them.
Previous versions of Go used segmented stacks, which are theoretically possible, if Zig really wanted (would need compiler support), but they have nasty performance side-effects, see https://www.youtube.com/watch?v=-K11rY57K7k
Resizing stacks on use does not depend on any of these properties of Go. You can do it like this in C, too. It does not require segmentation.
Resizing stacks insofar that expansion may require moving the stack to some other place in memory that can support the new size depends on these properties. Your initial 4k of coroutine stack may have been allocated some place that wont fit the new 8k of coroutine stack.
Or are you making a point about virtual memory? If so, that assumption seems highly platform dependent.
You would implement this with virtual memory. Obviously, this is less of a limited resource on 64-bit systems. And I wouldn't recommend the Go/stack/libtask style model for high concurrency on any platform.
I'm very interested to know how. Do you mean reserving a huge chunk of virtual memory and slowly allocating it? That works to some degree, but limits how many coroutines can you really spawn.
Yes, exactly.
Consider that resizing the stack may require reallocating it elsewhere in memory. This would invalidate any internal pointers to the stack.
AFAIK Go solves this by keeping track of these pointer locations and adjusting them when reallocating the stack. Aside from the run-time cost this incurs, this is unsuitable for Zig because it can't stricly know whether values represent pointers.
Go technically also has this problem as well, if you for example convert a pointer to a uintptr, but maintains no guarantee that a former pointer will still be valid when converted back. Such conversions are also rarely warranted and are made explicit using the `unsafe` package.
Zig is more like C in that it gives the programmer rather than a memory management runtime exclusive control and free rein over the memory. If there are some bits in memory that happen to have the same size as a pointer, Zig sees no reason to stop you from interpreting them as such. This is very powerful, but precludes abstractions like Go's run-time stack reallocation.
8kB is enough for 90% of use cases. But then you invoke getaddrinfo() once and now your stack is 128kB+.
The thread stack for something like libtask is ambiguously sized and often really large relative to like, formalized async state.
> callback-based async seems to have become the standard
At some level it's always callbacks. Then people build frameworks on top of these so programmers can pretend they're not dealing with callbacks.
> The Rust folks adopted async with callbacks
Rust's async is not based on callbacks, it's based on polling. So really there are three ways to implement async:
- The callback approach used by e.g. Node.js and Swift, where a function that may suspend accepts a callback as an argument, and invokes the callback once it is ready to make progress. The compiler transforms async/await code into continuation-passing style.
- The stackful approach used by e.g. Go, libtask, and this; where a runtime switches between green threads when a task is ready to make progress. Simple and easy to implement, but introduces complexity around stack size.
- Rust's polling approach: an async task is statically transformed into a state machine object that is polled by a runtime when it's ready to make progress.
Each approach has its advantages and disadvantages. Continuation-passing style doesn't require a runtime to manage tasks, but each call site must capture local variables into a closure, which tends to require a lot of heap allocation and copying (you could also use Rust's generic closures, but that would massively bloat code size and compile times because every suspending function must be specialized for each call site). So it's not really acceptable for applications looking for maximum performance and control over allocations.
Stackful coroutines require managing stacks. Allocating large stacks is very expensive in terms of performance and memory usage; it won't scale to thousands or millions of tasks and largely negates the benefits of green threading. Allocating small stacks means you need the ability to dynamically resize stacks at runtime, which requires dynamic allocation and adds significant performance and complexity overhead if you want to make an FFI call from an asynchronous task (in Go, every function begins with a prologue to check if there is enough stack space and allocate more if needed; since foreign functions do not have this prologue, an FFI call requires switching to a sufficiently large stack). This project uses fixed-sized task stacks, customizable per-task but defaulting to 256K [1]. This default is several orders of mangitude larger than a typical task size in other green-threading runtimes, so to achieve large scale the programmer must manually manage the stack size on a per-task basis, and face stack overflows if they guess wrong (potentially only in rare/edge cases).
Rust's "stackless" polling-based approach means the compiler knows statically exactly how much persistent storage a suspended task needs, so the application or runtime can allocate this storage up-front and never need to resize it; while a running task has a full OS thread stack available as scratch space and for FFI. It doesn't require dynamic memory allocation, but it imposes limits on things like recursion. Rust initially had stackful coroutines, but this was dropped in order to not require dynamic allocation and remove the FFI overhead.
The async support in Zig's standard library, once it's complete, is supposed to let the application developer choose between stackful and stackless coroutines depending on the needs of the application.
[1]: https://github.com/lalinsky/zio/blob/9e2153eed99a772225de9b2...
See also: https://man.9front.org/2/thread
The history of this concurrency model is here: https://seh.dev/go-legacy/
I think it started with an interrupt. And less abstraction often wins.
This is the only explanation here I can intuitively understand!