You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly. It's a matter of saving a few registers and switching the stack pointer, minicoro [1] is a pretty good C library that does it. I like this model a lot more than C++20 coroutines:
1. C++20 coros are stackless, in the general case every async "function call" heap allocates.
2. If you do your own stackful coroutines, every function can suspend/resume, you don't have to deal with colored functions.
3. (opinion) C++20 coros are very tasteless and "C++-design-commitee pilled". They're very hard to understand, implement, require the STL, they're very heavy in debug builds and you'll end up with template hell to do something as simple as Promise.all
> You can roll stackful coroutines in C++ (or C) with 50-ish lines of Assembly
I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C. And the obvious consequence is that it stops being portable. Minicoro only supports three architectures. Granted, those are the three most popular ones, but other architectures exist.
(just double checked and it doesn't do Windows/ARM, for example. Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon, but they have at least some of it)
> Not that I'm expecting Microsoft to ship full conformance for C++23 any time soon,
They are actively working on it for their VS2026 C++ compiler. I think since 2017 or so they've kept up with C++ standards reasonably? I'm not a heavy C++ guy, so maybe I'm wrong, but my understanding is they match the standards.
Boost has stackful coroutines. They also used to be in posix (makecontext).
> I'm not normally keen to "well actually" people with the C standard, but .. if you're writing in assembly, you're not writing in C.
These days on Linux/BSD/Solaris/macOS you can use makecontext()/swapcontext() from ucontext.h and it will turn out roughly the same performance on important architectures as what everyone used to do with custom assembly. And you already have fiber functions as part of the Windows API to trampoline.
I had to support a number of architectures in libdex for Debian. This is GNOME code of course, which isn't everyone's cup of C. (It also supports BSDs/Linux/macOS/Solaris/Windows).
* https://packages.debian.org/sid/libdex-1-1
* https://gitlab.gnome.org/GNOME/libdex
Unfortunately swap context requires saving and restoring the signal mask, which, at least on Linux, requires a syscall so it is going to be at least a hundred times slower than an hand rolled implementation.
Also, although not likely to be removed anytime soon from existing systems, POSIX has declared the context API obsolescent a while ago (it might actually no longer be part of the standard).
Stackful coroutines also can't be used to "send" a coroutine to a worker thread, because the compiler might save the address of a thread local variable across the thread switch (happened in QEMU).
Yes I know, GCC has a long standing bug open on the issue :(.
Signal mask? What century are we in?
It can be safely ignored for the vast majority of apps. If you're using multithreading (quite likely if you're doing coroutines), then signals are not a good fit anyway.
Aside from the fact that the signal mask is still relevant in 2026 and even for multithreaded programs, that doesn't have anything to do with the fact that POSIX requires swapcontext to preserve it.
In most cases you're already using signalfd in places where libdex runs.
Looking at the repo, it falls back to Windows fibers on Windows/ARM. If you'd like a coroutine with more backends, I'm a fan of libco: https://github.com/higan-emu/libco/ which has assembly backends for x86, amd64, ppc, ppc-64, arm, and arm64 (and falls back to setjmp on POSIX platforms and fibers on Windows). Obviously the real solution would be for the C or C++ committees to add stackful coroutines to the standard, but unless that happens I would rather give up support for hppa or alpha or 8-bit AVR or whatever than not be able to use stackful corountines.
A proposal to add stackfull coroutines has been around forever and gets updated at every single mailing. Unfortunately the authors don't really have backing from any major company.
There is no "Linux/ARM[64]". But there are "Raspberry Pi" and "RISC-V". I don't know such OSes, to be honest :-)
This support table is complete mess. And saying "most platforms are supported" is too optimistic or even cocky.
I think what they meant is that that what it takes to add coroutines support to a C/++ program. Adding it to, say, Java or C# is much more involved.
Hmm. I'm fairly certain that most of that assembly code for saving/restoring registers can be replaced with setjmp/longjmp, and only control transfer itself would require actual assembly. But maybe not.
That's the problem with register machines, I guess. Interestingly enough, BCPL, its main implementation being a p-code interpreter of sorts, has pretty trivially supported coroutines in its "standard" library since the late seventies — as you say, all you need to save is the current stack pointer and the code pointer.
> Hmm. I'm fairly certain that most of that assembly code for saving/restoring registers can be replaced with setjmp/longjmp, and only control transfer itself would require actual assembly.
Actually you don't even need setjmp/longjmp. I've used a library (embedded environment) called protothreads (plain C) that abused the preprocessor to implement stackful coroutines.
(Defined a macro that used the __LINE__ macro coupled with another macro that used a switch statement to ensure that calling the function again made it resume from where the last YIELD macro was encountered)
Wouldnt that be stackless (shared stack)
Correct; stackless. I misspoke.
You can do a lot of horrible things with setjmp and friends. I actually implemented some exception throw/catch macros using them (which did work) for a compiler that didn't support real C++ exceptions. Thank god we never used them in production code.
This would be about 32 years ago - I don't like thinking about that ...
GCC still uses sj/lj by default on some targets to implement exceptions.
setjmp + longjump + sigaltstack is indeed the old trick.
C++ destructors and exception safety will likely wreak havoc with any "simple" assembly/longjmp-based solution, unless severely constraining what types you can use within the coroutines.
Not really. I've done it years ago. The one restriction for code inside the coroutine is that it mustn't catch (...). You solve destruction by distinguishing whether a couroutine is paused in the middle of execution or if it finished running. When the coroutine is about to be destructed you run it one last time and throw a special exception, triggering destruction of all RAII objects, which you catch at the coroutine entry point.
Passing uncaught exceptions from the coroutine up to the caller is also pretty easy, because it's all synchronous. You just need to wrap it so it can safely travel across the gap. You can restrict the exception types however you want. I chose to support only subclasses of std::exception and handle anything else as an unknown exception.
> Passing uncaught exceptions from the coroutine up to the caller is also pretty easy, because it's all synchronous. You just need to wrap it so it can safely travel across the gap
This is also how dotnet handles it, and you can choose whether to rethrow at the caller site, inspect the exception manually, or run a continuation on exception.
> mustn't catch (...)
You could use the same trick used by glibc to implement unstoppable exceptions for POSIX cancellation: the exception rethrows itself from its destructor.
Thanks, that's interesting.
> every async "function call" heap allocates.
> require the STL
That it has to heap-allocate if non-inlined is a misconception. This is only the default behavior.
One can define:
void *operator new(size_t sz, Foo &foo)
in the coro's promise type, and this:
- removes the implicitly-defined operator new
- forces the coro's signature to be CoroType f(Foo &foo), and forwards arguments to the "operator new" one defined
Therefore, it's pretty trivial to support coroutines even when heap cannot be used, especially in the non-recursive case.
Yes, green threads ("stackful coroutines") are more straightforward to use, however:
- they can't be arbitrarily destroyed when suspended (this would require stack unwinding support and/or active support from the green thread runtime)
- they are very ABI dependent. Among the "few registers" one has to save FPU registers. Which, in the case of older Arm architectures, and codegen options similar to -mgeneral-regs-only (for code that runs "below" userspace). Said FPU registers also take a lot of space in the stack frame, too
Really, stackless coros are just FSM generators (which is obvious if one looks at disasm)
A stackful coroutine implementation has to save exactly the same registers that a stackless one has to: the live ones at the suspension point.
A pure library implementation that uses on normal function call semantics obviously needs to conservatively save at least all callee-save registers, but that's not the only possible implementation. An implementation with compiler help should be able to do significantly better.
Ideally the compiler would provide a built-in, but even, for example, an implementation using GCC inline ASM with proper clobbers can do significantly better.
Stackful makes for cute demos, but you need huge per-thread stacks if you actually end up calling into Linux libc, which tends to assume typical OS thread stack sizes (8MB). (I don't disagree that some of the other tradeoffs are nice, and I have no love for C++20 coroutines myself.)
As an x-gamedev, suspect/resume/stackful coroutines made them too heavy to have several thousand of them running during a game loop for our game. At the time we used GameMonkey Script: https://github.com/publicrepo/gmscript
That was over 20 years ago. No idea what the current hotness is.
Several thousand? What were you using them for? Coroutines' main utility is that they let you write complex code that pauses and still looks sensible, so for games, you'd typically put stuff like the behavior of an NPC in a coroutine. If you have thousands of things to put each in its own coroutine, they must have been really, really simple stuff. At that point, the cost of context switching can become significant.
> the cost of context switching can become significant.
Which is why a solution with no (or very tiny) context switching is preferred over one that's heavy to switch.
> they must have been really, really simple stuff
Yes, because they were low-overhead it was trivial to start them for all kinds of tiny things.
Actually you don't even need ASM at all. Just need to have smart use of compiler built-in to make it truly portable. See my composable continuation implementation: https://godbolt.org/z/zf8Kj33nY
A much nicer code base to study is: https://swtch.com/libtask/
The stack save/restore happens in: https://swtch.com/libtask/asm.S
Single OS thread only, FWIW (no M:N scheduling). And like any stackful implementation, requires relatively huge stack allocations if you actually call into stdlib, particularly things like getaddrinfo().