Async seems like an underbaked idea across the board. Regular code was already async. When you need to wait for an async operation, the thread sleeps until ready and the kernel abstracts it away. But We didn’t like structuring code into logical threads, so we added callback systems for events. Then realized callbacks are very hard to reason about and that sequential control is better.

So threads was the right programming model.

Now language runtimes prefer “green threads” for portability and performance but most languages don’t provide that properly. Instead we have awkward coloring of async/non-async and all these problems around scheduling, priority, and no-preemption. It’s a worse scheduling and process model than 1970.

> Regular code was already async. When you need to wait for an async operation, the thread sleeps until ready and the kernel abstracts it away

Not really. I’ve observed async code often is written in such a way that it doesn’t maximize how much concurrency can be expressed (eg instead of writing “here’s N I/O operations to do them all concurrently” it’s “for operation X, await process(x)”). However, in a threaded world this concurrency problem gets worse because you have no way to optimize towards such concurrency - threads are inherently and inescapably too heavy weight to express concurrency in an efficient way.

This is is not a new lesson - work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s this is why Apple developed GCD. Threads simply don’t provide any richer information it needs in the scheduler to the kernel about the workload and kernel threads are an insanely heavy mechanism for achieving fine grained concurrency and even worse when this concurrency is I/O or a mixed workload instead of pure compute that’s embarrassingly easily to parallelize.

Do all programs need this level of performance? No, probably not. But it is significantly more trivial to achieve a higher performance bar and in practice achieve a latency and throughput level that traditional approaches can’t match with the same level of effort.

You can tell async is directionally kind of correct in that io_uring is the kernel’s approach to high performance I/O and it looks nothing like traditional threading and syscalls and completion looks a lot closer to async concurrency (although granted exploiting it fully is much harder in an async world because async/await is an insufficient number of colors to express how async tasks interrelate)

> work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s

Well, we know how to make "traditional threads" fast, with lower latency and more consistent P99 since forever^2, in the early 90s. [1]

Sure, we can't convince that Finnish guy this is worthwhile to include in THE kernel, despite similar ideas had been running in Google datacenters for idk how many years, 15 years+? But nothing stops us from doing it in the userspace, just as you said, a work stealing executor. And no, no coloring.

Stack is all you need. Just make your "coroutines" stackful. Done. All those attempts trying to be "zero-cost" and change programming model dramatically to avoid a stack, introduced much more overhead than a stack and a piece of decent context switch code.

> You can tell async is directionally kind of correct in that io_uring is the kernel’s approach

lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.

[1] https://dl.acm.org/doi/10.1145/121132.121151

Stackful coroutines give up a fair amount of efficiency in a number of places to make that workable. It’s fine if you want to use a lot more RAM and Go and Java make that tradeoff, but that’s not suitable for something like Rust. That’s why Rust and C++’s async implementation is rather similar in many ways. Stackful coroutines also play havoc with FFI which carries a huge FFI cost penalty across the board even for code that doesn’t care about coroutines. These aren’t theoretical tradeoffs to just hand wave away as “doesn’t matter” - it literally does for how Rust is positioned. No one is stopping you from using Go or the JVM if that’s the ecosystem you like better.

> lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.

Not really. People latched on to async cancellation issues as intractable due to one paper but I’m not convinced it’s unsolvable whether due to runtimes that consider the issue more fundamentally or the language adding async drop which lets the existing runtimes solve the problem wholesale.

The point I’m making is that I/O and hardware is fundamentally non-blocking and we will always pay a huge abstraction penalty to try to pretend we have a synchronous programming model.

There are always trade-offs and there is never one best way to do something.

Stack-based coroutines is one way to do it. A relevant trade-off here is overhead, requiring a runtime and narrowing the potential use-cases this can serve (i.e embedded real-time stuff).

If you don’t care about supporting such use cases you can of course just create a copy of goroutines and be pretty happy with the result.

>despite similar ideas had been running in Google datacenters for idk how many years

I guess this is referring to https://www.youtube.com/watch?v=KXuZi9aeGTw ?

[deleted]

I am not saying threads are the model for all programming problems. For example a dependency graph like an excel spreadsheet can be analyzed and parallelized.

But as you observed, async/await fails to express concurrency any better. It’s also a thread, it’s just a worse implementation.

That’s incorrect. Even when expressed suboptimally, it still tends to result in overall higher throughput and consistently lower latency (work stealing executors specifically). And when you’re in this world, you can always do an optimization pass to better express the concurrency. If you’ve not written it async to start with, then you’re boned and have no easy escape hatch to optimize with.

Why can’t you do the same optimization? Are you maxing out you OS system resources on thread overhead?

That’s part of it. Then you add a thread pool to dispatch your tasks into to mitigate the cost of a thread start. Then you run into blocking problems and are like “I wish I had some keyword to express when a function needed to be run on the thread pool”. Then you’ve done a speed run of the past 40 years of research.

The 40 years of research was actually in OS theory so that you could write normal code and async was abstracted away.

A thread pool is not a research project.

Although they can be used in similar ways they work very differently.

* Cooperative vs. preemptive scheduling

* Userspace vs. kernel scheduling

* Stackless vs. stackful

* Easy control over waiting/blocking behavior vs. none

* Easy fan out + join vs. maybe, with some work and thread spawn overhead

* Can integrate within a single-threaded event loop vs. not really

Depending on what you're doing they may be interchangeable or you can only go one way. The basic cases where you're doing basically synchronous work in a thread/task is no different either way, other than having colored functioned with async/await and efficiency. If you're doing some UI work then event handlers are likely running in a single-threaded event loop which is the only thread you can interact with the UI on, which you can't block or the UI is going to freeze.

OS thread overhead can be pretty substantial. Starting new threads on Windows is especially expensive.

> threads are inherently and inescapably too heavy weight to express concurrency in an efficient way

Your premise is wrong. There are many counterexamples to this.

Can you explain more ? I always heard this.

The most promiment example is probably Go with its goroutines, but there are so many more. You can easily spawn tens of thousands of goroutines, with low overhead and great performance.

Goroutines/"fibers"/"green threads" are usually scheduled by the runtime system across a small pool of actual OS threads.

The word "thread" is confusing things. In computer science a thread represents a flow of execution, which in concrete terms where execution is a series of function calls, is typically a program counter and a stack.

There are many ways to implement and manage threads. In Unix-like and Windows systems a "thread" is the above, plus a bunch of kernel context, plus implicit preemptive context switching. Because Unix and Windows added threads to their architectures relatively late in their development, each thread has to behave sort of like its own process, capable of running all the pre-existing software that was thread-agnostic. Which is why they have implicit scheduling, large userspace stacks, etc.

But nothing about "thread" requires it to be implemented or behave exactly like "OS threads" do in popular operating systems. People wax on about Async Rust and state machines. Well, a thread is already state machine, too. Async Rust has to nest a bunch of state machine contexts along with space for data manipulated in each function--that's called a stack. So Async Rust is one layer of threading built atop another layer of threading. And it did this not because it's better, but primarily because of legacy FFI concerns and interoperability with non-Rust software that depended on the pre-existing ABIs for stack and scheduling management.

Go largely went in the opposite direction, embracing threads as a first-class concept in a way that makes it no less scalable or cheap than Rust Futures, notwithstanding that Go, too, had to deal with legacy OS APIs and semantics, which they abstracted and modeled with their G (goroutine), M (machine), P (processor) architecture.

I thought it was obvious from context: OS threads are too heavyweight for fine grained concurrency

Go uses userspace threads. It’s also interesting that Go and Java are the only mainstream languages to have gone this route. The reason is that it has a huge penalty when calling FFI of code that doesn’t use green threads whereas this cost isn’t there for async/await.

Also that you have to rewrite the entire standard library, because the kernel knows how to suspend kernel threads on syscalls, but not green threads. (Go and Java already had to do this anyway, of course.)

> the thread sleeps until ready and the kernel abstracts it away.

Sure, but once you involve the kernel and OS scheduler things get 3 to 4 orders of magnitude slower than what they should be.

The last time I was working on our coroutine/scheduling code creating and joining a thread that exited instantly was ~200us, and creating one of our green threads, scheduling it and waiting for it was ~400ns.

You don't need to wait 10 years for someone else to design yet another absurdly complex async framework, you can roll your own green threads/stackful coroutines in any systems language with 20 lines of ASM.

1. Why can’t we have better green threads implementations with better scheduling models?

2. Unchecked array operations are a lot faster. Manual memory management is a lot faster. Shared memory is a lot faster.

Usually when you see someone reach for sharp and less expressive tools it’s justified by a hot code path. But here we jump immediately to the perf hack?

3. How many simultaneous async operations does your program have?

Well, if you offload heavy compute into an async task, then usually it depends strictly on how many concurrent inputs you are given. But even something as “simple” as a performance editor benefits from this if done well - that’s why JS text editors have reasonably acceptable performance whereas Java IDEs always struggled (historically anyway since even Java has adopted green threads).

Are you sure Java's UI issues are caused by threading and not just Swing being a glitchy pile of junk?

For example, if you don't explicitly call the java.awt.Toolkit.sync() method after updating the UI state (which according to the docs "is useful for animation"), Swing will in my experience introduce seemingly random delays and UI lag because it just doesn't bother sending the UI updates to the window system.

only netbeans is written in swing . Eclipse and Jetbrains use their own thing and still generally struggled.

No, JetBrains use Swing in IntelliJ IDEA. You can tell from how it (for example) fails to layout dialogs correctly the first time they're displayed, just like every other Swing application. And how windows have no minimum size because Swing doesn't expose that functionality. And the various baffling bugs involving window focus that are inherent to Swing applications.

Eclipse uses SWT instead, which wraps the platform's native widgets.

When did you last use IntelliJ, 30 years ago? I've never seen it fail to lay out dialogs correctly, windows do have minimum sizes, and I haven't seen any focus bugs.

You think IDEs are written in JS because of the performance benefits of the threading model?

I thought it was because they could copy chromium.

Why do you think they don’t struggle with input latency? Because the non blocking nature built into the browser model is so powerful and you cannot get that with threads.

I disagree with the premise. I cannot imagine a better latency experience than blocking loop IDEs like VS6.

Which inputs are getting latency? The keyboard? The files?

> the non blocking nature

https://youtu.be/bzkRVzciAZg?si=BuBXxHTgN0OqsAhI

Hate to break it to you but windows gui programming, emblemified by VS6, is about as far away from a blocking threaded model as you get. You literally have a UI event loop and any compute intensive work is meant to be offloaded to other threads via messages/COM. This is why when they failed to do that correctly the entire UI would lock up - because they didn’t have good hygiene around how to offload compute intensive operations that also interacted with the GUI.

You’ve literally argued against yourself without realizing.

Wait which programming model are you arguing is the low latency one? I thought you said it was JS because non-blocking.

Event loops are also non blocking. That’s literally why JS is non blocking. But event loops and callbacks are extremely hard to scale and maintain and keep non blocking. That’s why async/await is a more powerful abstraction - you don’t pretend I/O is this blocking thing, you interleave other work while it’s being done, and you don’t get impossible to follow callback hell. VS6 suffered from non responsive hangs all the time because some developer forgot to offload something that turned out to be compute heavy under certain conditions.

Also, the parts they couldn’t make non-blocking (eg file reads) were precisely where VS6 would shit the bed and hang the entire UI trying to open a large file.

Are you sure that latency-sensitive parts are written in async JS instead of having a separate UI thread (pool)? I have no idea myself, but without knowing the details it's hard to argue. Note, that browsers themselves, are usually written in languages like C++ or Rust. They run JS, but aren't written in it

If you implement threads and code that reacts to an input queue (e.g. PostMessage, queue_push, mq_send, ...), you've implemented (probably a bad version of) async threads. And yes, that's exactly what Windows 1.0 did and what made it great.

But God help you if you have to change the code. Async threads are a way to organize it and make it workable for humans.

Yes they are, the UI layer is mostly JS, outside the rendering and layout engines.

Maybe you remember performance of IDEs from 15 years ago because that definitely isn't my experience.

> that’s why JS text editors have reasonably acceptable performance

Absolutely not

You involve the kernel also when you are doing async io.

In this context the interesting thing to measure would be doing IO in your green threads vs OS threads.

A stronger theoretical performance argument for async io is that you can do batching, ala io_uring, and do fewer protection domain crossings per IO that way.

[deleted]

Well yeah of course, using APIs io_uring and grand central dispatch is basically the whole point of all this async stuff in a systems programming language. It’s absurd it hasn’t been mentioned more here.

OS Threads are for compute parallelism, async with stackless coroutines (ideally) or green threads is for IO parallelism. It’s pretty straight forward.

And IMO, Zig has show how to do async IO right (the foundational stuff. Other languages could add better syntax for ergonomics.

It's not the whole point, there's lots of other (albeit smaller) gains to be had once you have a strong async apparatus.

The core of your async implementation doesn't have to care about I/O - as long as it has a way to block/schedule fibers, it's easy to implement io_uring/IOCP based I/O on top of that - it's a matter of sticking a single IO poll in your main loop, and when you get a result, schedule the fiber that's waiting for it.

Another thing you get almost for free is an accurate Sleep(0.3) - your Sleep pushes the current fiber in a global vector with the time to be resumed, and you loop over that vector in your main loop.

We're writing a game engine so WaitForNextFrame() is another useful one - the implementation is literally pushing the current fiber to a vector and resuming it the next tick.

> So threads was the right programming model.

It depends on what you are doing. Threads are the right model for compute-bound workloads. Async is the right model for bandwidth-bound workloads.

Optimization of bandwidth-bound code is an exercise in schedule design. In a classic multithreading model you have limited control over scheduling. In an async model you can have almost perfect control over scheduling. A well-optimized async schedule is much faster than the equivalent multithreaded architecture for the same bandwidth-bound workload. It isn't even close.

Most high-performance code today is bandwidth-bound. Async exists to make optimization of these workloads easier.

If this is a classic exercise can you show me the material?

Why can’t a scheduler be written which optimizes around IO? What additional information is present in code that has async/await annotations?

Threads are a scheduling model that delegates to the OS scheduler. Async style provides a primitive for creating a custom scheduler but is not a scheduler per se.

To use a custom scheduler you must first disable the existing schedulers your code is using by default for both execution and I/O. That means no OS scheduling. Thread-per-core architectures with static allocation and direct userspace I/O is the idiomatic way to do this regardless of programming language.

Optimal scheduling is a profoundly intractable problem -- it is AI-Complete. A generic scheduler is always going to be deeply suboptimal because a remotely decent schedule isn't practically computable in real systems. A more optimal scheduler must continuously rewrite the selection and ordering of thousands of concurrent operations in real-time. Importantly, this dynamic schedule rewriting is based on a model that can see across all operations globally and accurately predict both future operations that haven't happened yet and any ordering dependencies between current and future operations. A modern system can handle tens of millions of these operations per second, so the scheduling needs to be efficient.

A generic scheduler has to allow for almost arbitrary operation graphs and behavior. However, if you are writing e.g. a database engine, you have almost the entire context of how operations relate to each other both concurrently and across time. The design of a somewhat optimal scheduler that only understands your code becomes computationally feasible. It isn't trivial -- scheduler design is properly difficult -- but you build it using async style.

That’s not what I asked.

I'm going to hop in and say this would be a good exercise for you, instead. The industry has, in general, decided upon stackless threads and other async systems.

What does "I/O optimized scheduling" look like to you, and does it end up with the same sort of compiler hints, like "async / await"? Or is it different?

I believe that's actually how the virtual threads in the newer Java works. It's smart enough to notice IO and properly park it and move to another thread.

I think it's still basically doing epoll behind the scenes [1], but you have straightforward sequential code in the process and the actual implementation is invisible to the user, and you can use old boring blocking code with an object that is a drop-in replacement for Thread.

I personally still kind of prefer the explicit async stuff with Futures and Vert.x since I kind of like the idea that async is encoded into the type itself so you're more directly aware of it, but I'm definitely an outlier for that.

[1] Genuinely, please correct me if I'm wrong, it's very possible that I am.

> but I'm definitely an outlier for that

You are not. I prefer the same and that's how my product works right now. My HTTP API is Vert.x-only with futures. My particular use case is thousands of devices sending small packages to the API in undefined periods of time or in bursts, so I find Vert.x event-loop performance quite a good match for my use case. In fact it has been very positive given customer feedback thusfar.

Background tasks in my app are processed in a different module, which uses plain old ScheduledExecutorService-based thread pool to poll. The tasks are visible in the UI as well. I still haven't switched to VTs, because I don't know what load-implications that may have on the database pool. The JEP writes `Do not pool virtual threads` [0]. I assume if a db connection is not available in the pool, the VT will get parked, but I feel this isn't quite what a background scheduler should look like, e.g., hundreds of "in-process" tasks blocked while waiting for db connection to free up. Testing is on my todo list for some time now.

The JEP doesn't mention epoll, but there is a write up about that on github: `On Linux the poller uses epoll, and on Windows wepoll (which provides an epoll-like API on the Ancillary Function Driver for Winsock)` [1]

0 - https://openjdk.org/jeps/444#Do-not-pool-virtual-threads

1 - https://gist.github.com/ChrisHegarty/0689ae92a01b4311bc8939f...

Glad I'm not alone! I find having the actual asynchrony itself as an object I can play with to allow for for some nice fine-grained concurrency and allows me to be very explicit about when blocking happens.

It makes sense that they would use epoll under the covers; I would have been surprised if they weren't using epoll or io_uring/kqueue.

I think that callbacks are actually easier to reason about:

When it comes time to test your concurrent processing, to ensure you handle race conditions properly, that is much easier with callbacks because you can control their scheduling. Since each callback represents a discrete unit, you see which events can be reordered. This enables you to more easily consider all the different orderings.

Instead with threads it is easy to just ignore the orderings and not think about this complexity happening in a different thread and when it can influence the current thread. It isn't simpler, it is simplistic. Moreover, you cannot really change the scheduling and test the concurrent scenarios without introducing artificial barriers to stall the threads or stubbing the I/O so you can pass in a mock that you will then instrument with a callback to control the ordering...

The problem with callbacks is that the call stack when captured isn't the logical callstack unless you are in one of the few libraries/runtimes that put in the work to make the call stacks make sense. Otherwise you need good error definitions.

You can of course mix the paradigms and have the worst of both worlds.

I agree. I don’t think callbacks are an underbaked language feature.

In another part of the thread, you lament the use of callbacks. May I ask you what you think async/await is, except syntactic sugar that wraps around the callback pattern?

Certain architectures focused around callbacks have problems. But their existence is not a burden on the language design.

Node.js has a problem where every standard library function has a callback and blocking version. At least they just committed to doing both.

Threads are neither better or worse than async+callbacks. They are different. There are problems which map nicely to threads and there are problems which are much nicer to express with async.

Such as? The entire premise of async is that callbacks were a mistake because they broke sequential reasoning and control.

Every explanation of the feature starts with managing callback hell.

Beware, they are different concepts.

Threads offer concurrent execution, async (futures) offer concurrent waiting. Loosely speaking, threads make sense for CPU bound problems, while async makes sense for IO bound problems.

Why? You write the same code with async await but with a keyword at the beginning of every function.

Because if you go down the callstack eventually you won't get the await keyword anymore; you'll get the actual 'waiters' and 'wakers' which define your scheduling

Yeah. The OS handles scheduling and preemption so it’s done for you rather than a call in the stack.

The entire premise of callbacks is that threads were a mistake because they broke sequential reasoning and control.

JK, obviously callbacks became prominent as a result of folks looking for creative solutions to the C10K[0] problem, but threads have a long history of haters[1][2][3].

[0] https://en.wikipedia.org/wiki/C10k_problem

[1] https://brendaneich.com/2007/02/threads-suck/

[2] https://web.stanford.edu/~ouster/cgi-bin/papers/threads.pdf

[3] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-...

Async/await implementations usually also come with a runtime to handle the work scheduling as well as manage thread context. You can say that you can do that with just threads and callbacks but that's also essentially implementing async/await.

The callbacks should be just hidden from programmer, that's what async/await are for.

The problem comes from trying to sit on both chairs: we want async but want to be able to opt out. This is what causes most of the ugliness, including function colouring. Just look at golang, where everything is async with no way to change it, it's great. It's, probably, not well-suited for things like microcontrollers, where every byte matters, but if you can afford the overhead, it's so much better than Rust async. Before async Rust was an interesting and reasonable language, now it's just a hot mess that makes your eyes bleed for no reason.

> It's, probably, not well-suited for things like microcontrollers, where every byte matters, but if you can afford the overhead, it's so much better than Rust async.

There is one hill I'll die on, as far as programming languages go, which is that more people should study Céu's structured synchronous concurrency model. It specifically was designed to run on microcontrollers: it compiles down to a finite state machine with very little memory overhead (a few bytes per event).

It has some limitations in terms of how its "scheduler" scales when there are many trails activated by the same event, but breaking things up into multiple asynchronous modules would likely alleviate that problem.

I'm certain a language that would suppprt the "Globally Asynchronous, Locally Synchronous" (GALS) paradigm could have their cake and eat it too. Meaning something that combines support for a green threading model of choice for async events, with structured local reactivity a la Céu.

F'Santanna, the creator of Céu, actually has been chipping away at a new programming language called Atmos that does support the GALS paradigm. However, it's a research language that compiles to Lua 5.4. So it won't really compete with the low-level programming languages there.

[0] https://ceu-lang.org/

[1] https://github.com/atmos-lang/atmos

Everything is not async in Go.

If your threads are "free" you can just run 400 copies of a synchronous code and blocking in one just frees the thread to work on other. async within same goroutine is still very much opt in (you have to manually create goroutine that writes to channel that you then receive on), it just isn't needed where "spawn a thread for each connecton" costs you barely few kb per connection.

What GP meant - what everyone means when they say this - is that goroutines are always M:N threading and so there is no such thing as function coloring. In Rust to get M:N threading you have to use async and in practice every library you use has to use async. Hence function coloring, and two separate ecosystems of libraries in the same language.

> not well-suited for things like microcontrollers, where every byte matters

except when a RAM fetch is so expensive a load is basically an async call - and it's a single machine code instruction at the same time

> So threads was the right programming model.

For problems that aren't overly concerned with performance/memory, yes. You should probably reach for threads as a default, unless you know a priori that your problem is not in this common bucket.

Unfortunately there is quite a lot of bookkeeping overhead in the kernel for threads, and context switches are fairly expensive, so in a number of high performance scenarios we may not be able to afford kernel threading

In that sentence I’m referring to the abstract idea of a thread of execution as a model of programming, not OS threads. A green thread implementation could do it too.

But what you said about kernel implementation is true. But are we really saying that the primary motivation for async/await is performance? How many programmers would give that answer? How many programs are actually hitting that bottleneck?

Doesn’t that buck the trend of every other language development in the past 20 years, emphasizing correctness and expressively over raw performance?

> But are we really saying that the primary motivation for async/await is performance?

Of course - what else would it be? The whole async trend started because moving away from each http request spawning (or being bound to) an OS thread gave quite extreme improvements in requests/second metrics, didn't it?

I agree. Managing many http requests or responses was a motivating problem.

What I question is whether 1. Most programs resemble that, so that they make it an invasive feature of every general purpose language. 2. Whether programmers are making a conscious choice because they ruled out the perf overhead of the simpler model we have by default.

That is why we have the function colouring problem and a split ecosystem in the first place - if it were obviously better in all cases, we'd make async the default, and get rid of the split altogether (and there are languages, like Erlang, that fall on this side of the fence)

It was not for performance reasons, but for scaling up.

That's the same thing?

> But are we really saying that the primary motivation for async/await is performance?

The original motivation for not using OS threads was indeed performance. Async/await is mostly syntax sugar to fix some of the ergonomic problems of writing continuation-based code (Rust more or less skipped the intermediate "callback hell" with futures that Javascript/Python et al suffered through).

In some languages, yes, in others (js/python) async is just workaround about not having proper threading.

Python used multiple threads to handle I/O long before async/await was a glimmer in anyone's mind (despite the GIL). nodejs is one of the very few languages I can think of that was born single-threaded and used an asynchronous runtime from the get-go

Importantly though, performance might be worse depending on use case and program. Specifically with scheduling in user space it can negatively impact branch prediction as your CPU is already hyper optimized for doing things differently.

It's all nuanced and what to choose requires careful evaluation.

As I understand, "green threads" are also expensive, for example you either need to allocate a large stack for each "thread", or hook stack allocation to grow the stack dynamically (like Go does), and if you grow the stack, you might have to move it and cannot have pointers to stack objects.

Green threads are fine for large servers with memory overcommit. Even with static stack sizes, you get benefits over OS threads due to the simpler scheduling. But the post was about embedded and green threads really suck there. Only using as much stack as you need for the task is the perfect solution for embedded systems.

>and if you grow the stack, you might have to move it

Most stacks are tiny and have bounded growth. Really large stacks usually happen with deep recursion, but it's not a very common pattern in non-functional languages (and functional languages have tail call optimization). OS threads allocate megabytes upfront to accommodate the worst case, which is not that common. And a tiny stack is very fast to copy. The larger the stack becomes, the less likely it is to grow further.

>cannot have pointers to stack objects

In Go, pointers that escape from a function force heap allocation, because it's unsafe to refer to the contents of a destroyed stack frame later on in principle. And if we only have pointers that never escape, it's relatively trivial to relocate such pointers during stack copying: just detect that a pointer is within the address range of the stack being relocated and recalculate it based on the new stack's base address.

works fine in Go.

Yes, you're not getting Rust performance (tho good part of it is their own compiler vs using all LLVM goodness) but performance is good enough and benefits for developers are great, having goroutines be so cheap means you don't even need to do anything explicitly async to get what you want

Rust chose a different design space for their async implementation though, so what works well for Go wouldn't work well for Rust. In particular, the Rust devs wanted zero-cost FFI that external code doesn't need to know about, which precludes Go-like green threads.

Rust can be used in contexts like dynamic linkers, kernels, libc, microcontrollers, dynamic libraries, and all sorts of places go has no business running. And it can use async in many of them. Go works fine for many contexts but we already have languages like go that work for those contexts. Rust is for the contexts it doesn't work well for. It's painful that it keeps being pushed to support things that would make it more difficult to support the areas it is unique in supporting.

Awaiting allows you to efficiently yield the thread to other tasks instead of blocking it. That's one of its biggest advantages.

When you block the OS does the same thing - yields to other threads.

Yes, and it is extremely expensive. This is a well-known design problem in database engines.

The computational cost of context-switching threads at yield points is often many times higher than the actual workload executed between yield points. To address this you either need fewer yield points, which reduces concurrency, or you need to greatly reduce the cost of yielding. An async architecture reduces the cost of yielding by multiple orders of magnitude relative to threads.

> The computational cost of context-switching threads at yield points is often many times higher than the actual workload executed between yield points.

I would they this often is 1% of cases. As for Rust ecosystem, it doesn't make much case to add so much complexity and inconvenient abstractions to cover 1% of use-cases.

It approaches 100% of cases for systems that care about software performance, since software performance is bandwidth bound. If almost everyone agrees that software performance is optimally fast already then I agree with you.

There is perfect performance and there is performance good enough, which is 99% of cases, where adding complexity is not justified.

And how much slower is that? What happens when I run a thousand async tasks? I'll give you a hint, with async/await, it has barely any overhead.

The vast, vast majority of programmers are going to be writing software where there are only a handful of threads (if that). The "I need thousands of concurrent executions" case is simply not relevant to most people.

You do realize what servers do in parallel right? Async/await allows ASP.NET to scale beyond 1 thread per request.

Are you going to put multiple customer’s data in the same OS process?

Did you know you can get even more performance if you manually manage memory and don’t use virtual functions?

> Now language runtimes prefer “green threads” for portability and performance

"Green threads" only exist in crappy interpreted languages, and only because they have stop-the-world single-threaded garbage collection.

Go and Java both have green threads, and are not interpreted nor limited to single threaded GC.

I’m just waiting for them to try co-operative multithreading again.

That's what async/await is, no? Yielding by awaiting is co-operative.

Proper modern languages offer both, you can keep your threads and reach out to async only when it makes sense to do.

Now the languages that don't offer choice is another matter.

That immediately falls apart if you want to attempt the extremely common pattern of wait free usage of a main/UI thread.

You don’t have threads on embedded, but you want a way to express concurrent waiting. Different problems altogether

You can, though. We used pthreads (well, pthresd compatible API) in production at massive scale on the ESP32-S3.

I think you are correct, in so far that often N:M threading is overkill for the problem at hand. However, some IO bound problems truly do require it. I haven't kept up with the details, but AFAIK the fallout from Spectre and Meltdown also means context switches are more expensive than they were historically, which is another downside with regular threads.

I also want to address something that I've seen in several sub-threads here: Rust's specific async implementation. The key limitation, compared to the likes of Go and JS, is that Rust attempts to implement async as a zero-cost abstraction, which is a much harder problem than what Go and JS does. Saying some variant of "Rust should just do the same thing as Go", is missing the point.

I think rust didn’t need async at all.

The question then becomes what, if anything, should take its place, and what are the corresponding tradeoffs?

What is kernel in this context?