The general rule is that if you need to wait faster use async, and if you need to process faster use threads.

Another way of thinking about this is whether you want to optimize your workload for throughput or latency. It's almost never a binary choice, though.

Threads as they are conventionally considered are inadequate. Operating systems should offer something along the lines of scheduler activations[0]: a low-level mechanism that represents individual cores being scheduled/allocated to programs. Async is responsive simply because it conforms to the (asynchronous) nature of hardware events. Similarly, threads are most performant if leveraged according to the usage of hardware cores. A program that spawns 100 threads on a system with 10 physical cores is just going to have threads interrupting each other for no reason; each core can only do so much work in a time frame, whether it's running 1 thread or 10. The most performant/efficient abstraction is a state machine[1] per core. However, for some loss of performance and (arguable) ease of development, threads can be used on top of scheduler activations[2]. Async on top of threads is just the worst of both worlds. Think in terms of the hardware resources and events (memory accesses too), and the abstractions write themselves.

[0] https://en.wikipedia.org/wiki/Scheduler_activations, https://dl.acm.org/doi/10.1145/121132.121151 | Akin to thread-per-core

[1] Stackless coroutines and event-driven programming

[2] User-level virtual/green threads today, plus responsiveness to blocking I/O events

Haven't scheduler activations largely been abandoned in the bad and linux kernels?

Yes; my understanding is that, for kernels designed for 1:1 threading, scheduler activations are an invasive change and not preferred by developers. Presumably, an operating system designed around scheduler activations would be able to better integrate them into applications, possibly even binary-compatibly with existing applications expecting 1:1 threading.

Can you say more about what you mean by wait faster? Is it as in, enqueue many things faster?

Not the OP but I'll take a stab: I see "waiting faster" as meaning roughly "check the status of" faster.

For example, you have lots of concurrent tasks, and they're waiting on slow external IO. Each task needs its IO to finish so you can make forward progress. At any given time, it's unlikely more than a couple of tasks can make forward progress, due to waiting on that IO. So most of the time, you end up checking on tasks that aren't ready to do anything, because the IO isn't done. So you're waiting on them to be ready.

Now, if you can do that "waiting" (really, checking if they're ready for work or not) on them faster, you can spend more of your machine time on whatever actual work _is_ ready to be done, rather than on checking which tasks are ready for work.

Threads make sense in the opposite scenario: when you have lots of work that _is_ ready, and you just need to chew through it as fast as possible. E.g. numbers to crunch, data to search through, etc.

I'd love if someone has a more illustrative metaphor to explain this, this is just how I think about it.