It's worth noting that this is not async/await in the sense of essentially every other language that uses those terms.

In other languages, when the compiler sees an async function, it compiles it into a state machine or 'coroutine', where the function can suspend itself at designated points marked with `await`, and be resumed later.

In Zig, the compiler used to support coroutines but this was removed. In the new design, `async` and `await` are just functions. In the threaded implementation used in the demo, `await` just blocks the thread until the operation is done.

To be fair, the bottom of the post explains that there are two other Io implementations being planned.

One of them is "stackless coroutines", which would be similar to traditional async/await. However, from the discussion so far this seems a bit like vaporware. As discussed in [1], andrewrk explicitly rejected the idea of just (re-)adding normal async/await keywords, and instead wants a different design, as tracked in issue 23446. But in issue 23446 the seems to be zero agreement on how the feature would work, how it would improve on traditional async/await, or how it would avoid function coloring.

The other implementation being planned is "stackful coroutines". From what I can tell, this has more of a plan and is more promising, but there are significant unknowns.

The basis of the design is similar to green threads or fibers. Low-level code generation would be identical to normal synchronous code, with no state machine transform. Instead, a library would implement suspension by swapping out the native register state and stack, just like the OS kernel does when switching between OS threads. By itself, this has been implemented many times before, in libraries for C and in the runtimes of languages like Go. But it has the key limitation that you don't know how much stack to allocate. If you allocate too much stack in advance, you end up being not much cheaper than OS threads; but if you allocate too little stack, you can easily hit stack overflow. Go addresses this by allocating chunks of stack on demand, but that still imposes a cost and a dependency on dynamic allocation.

andrewrk proposes [2] to instead have the compiler calculate the maximum amount of native stack needed by a function and all its callees. In this case, the stack could be sized exactly to fit. In some sense this is similar to async in Rust, where the compiler calculates the size of async function objects based on the amount of state the function and its callees need to store during suspension. But the Zig approach would apply to all function calls rather than treating async as a separate case. As a result, the benefits would extend beyond memory usage in async code. The compiler would statically guarantee the absence of stack overflow, which benefits reliability in all code that uses the feature. This would be particularly useful in embedded where, typically, reliability demands are high and memory available is low. Right now in embedded, people sometimes use a GCC feature ("-fstack-usage") that does a similar calculation, but it's messy enough that people often don't bother. So it would be cool to have this as a first-class feature in Zig.

But.

There's a reason that stack usage calculators are uncommon. If you want to statically bound stack usage:

First, you have to ban recursion, or else add some kind of language mechanism for tracking how many times a function can possibly recurse. Banning recursion is common in embedded code but would be rather annoying for most codebases. Tracking recursion is definitely possible, as shown by proof languages like Agda or Coq that make you prove termination of recursive functions - but those languages have a lot of tools that 'normal' languages don't, so it's unclear how ergonomic such a feature could be in Zig. The issue [2] doesn't have much concrete discussion on how it would work.

Second, you have to ban dynamic calls (i.e. calls to function pointers), because if you don't know what function you're calling, you don't know how much stack it will use. This has been the subject of more concrete design in [3] which proposes a "restricted" function pointer type that can only refer to a statically known set of functions. However, it remains to be seen how ergonomic and composable this will be.

Zooming back out:

Personally, I'm glad that Zig is willing to experiment with these things rather than just copying the same async/await feature as every other language. There is real untapped potential out there. On the other hand, it seems a little early to claim victory, when all that works today is a thread-based I/O library that happens to have "async" and "await" in its function names.

Heck, it seems early to finalize an I/O library design if you don't even know how the fancy high-performance implementations will work. Though to be fair, many applications will get away just fine with threaded I/O, and it's nice to see a modern I/O library design that embraces that as a serious option.

[1] https://github.com/ziglang/zig/issues/6025#issuecomment-3072...

[2] https://github.com/ziglang/zig/issues/157

[3] https://github.com/ziglang/zig/issues/23367

> But it has the key limitation that you don't know how much stack to allocate. If you allocate too much stack in advance, you end up being not much cheaper than OS threads; but if you allocate too little stack, you can easily hit stack overflow.

With a 64-bit address space you can reserve large contiguous chunks (e.g. 2MB), while only allocating the minimum necessary for the optimistic case. The real problem isn't memory usage, per se, it's all the VMA manipulation and noise. In particular, setting up guard pages requires a separate VMA region for each guard (usually two per stack, above and below). Linux recently got a new madvise feature, MADV_GUARD_INSTALL/MADV_GUARD_REMOVE, which lets you add cheap guard pages without installing a distinct, separate guard page. (https://lwn.net/Articles/1011366/) This is the type of feature that could be used to improve the overhead of stackful coroutines/fibers. In theory fibers should be able to outperform explicit async/await code, because in the non-recursive, non-dynamic call case a fiber's stack can be stack-allocated by the caller, thus being no more costly than allocating a similar async/await call frame, yet in the recursive and dynamic call cases you can avoid dynamic frame bouncing, which in the majority of situations is unnecessary--the poor performance of dynamic frame allocation/deallocation in deep dynamic call chains is the reason Go switched from segmented stacks to moveable stacks.

Another major cost of fibers/thread is context switching--most existing solutions save and restore all registers. But for coroutines (stackless or stackful), there's no need to do this. See, e.g., https://photonlibos.github.io/blog/stackful-coroutine-made-f..., which tweaked clang to erase this cost and bring it line with normal function calls.

> Go addresses this by allocating chunks of stack on demand, but that still imposes a cost and a dependency on dynamic allocation.

The dynamic allocation problem exists the same whether using stackless coroutines, stackful coroutines, etc. Fundamentally, async/await in Rust is just creating a linked-list of call frames, like some mainframes do/did. How many Rust users manually OOM check Boxed dyn coroutine creation? Handling dynamic stack growth is technically a problem even in C, it's just that without exceptions and thread-scoped signal handlers there's no easy way to handle overflow so few people bother. (Heck, few even bother on Windows where it's much easier with SEH.) But these are fixable problems, it just requires coordination up-and-down the OS stack and across toolchains. The inability to coordinate these solutions does not turn ugly compromises (async/await) into cool features.

> First, you have to ban recursion, or else add some kind of language mechanism for tracking how many times a function can possibly recurse. [snip] > > Second, you have to ban dynamic calls (i.e. calls to function pointers)

Both of which are the case for async/await in Rust; you have to explicitly Box any async call that Rust can't statically size. We might frame this as being transparent and consistent, except it's not actually consistent because we don't treat "ordinary", non-async calls this way, which still use the traditional contiguous stack that on overflow kills the program. Nobody wants that much consistency (too much of a "good" thing?) because treating each and every call as async, with all the explicit management that would entail with the current semantics would be an indefensible nightmare for the vast majority of use cases.

> If you allocate too much stack in advance, you end up being not much cheaper than OS threads;

Maybe. a smart event loop could track how many frames are in flight at any given time and reuse preallocated frames when their frames dispatch out.

Regarding recursive functions, would it really be that annoying? We kinda take the ability to recurse for granted, yet it is rarely used in practice, and often when it happens it's unintentional and a source of bugs (due to unforeseen re-entrancy). Intuitively it feels that if `recursive` was a required modifier when declaring intentionally recursive functions, like in Fortran, it wouldn't actually be used all that much. Most functions don't need to call via function pointers, either.

Being explicit about it might also allow for some interesting compiler optimizations across shared library boundaries...

Fortran is recursive by default.

> tracking how many times a function can possibly recurse.

> Tracking recursion is definitely possible, as shown by proof languages like Agda or Coq that make you prove termination of recursive functions

Proof languages don't really track how many times a function can possibly recurse, they only care that it will eventually terminate. The amount of recursive steps can even easily depend on the inputs, making it unknown at the moment a function is defined.

Right. They use rules like "a function body must destructure its input and cannot use a constructor" which implies that, since input can't be infinite, the function will terminate. That doesn't mean that the recursion depth is known before running.

The actual rule is more refined than that and doesn't prevent you from using constructors in the body or recursing without deconstructing any input, it just needs to prove the new inputs are "smaller" according to a fixed well-ordered relation. This is most often related to the structure of the input but is not required. For example I can have f(5) recurse into f(6) if that's "smaller" according to my custom relation, but no relation will allow me to continue increasing the argument forever.

This blog post had a better explanation than the one linked from the story IMO:

https://kristoff.it/blog/zig-new-async-io/