Where does the improved performance come from? The project does actually outline different factors, but I wonder which of them are the biggest ones, or are they all equally important?

And how big a player is the busyloop-locking. Yes, the code is telling CPU do it energy-efficiently, but it's not going to beat the OS if the loop is waiting for more job that's not coming for a while.. Is it doing this with every core?

One factor could be that when a subprocess dies, then it doesn't need to release any memory, as the OS deals with it in one* go, versus a thread teardown where you need to be neat. Though I suppose this would not be a lot of work.

Compared to Rayon or Taskflow, the biggest initial win is cutting out heap allocations for all the promise/result objects — those act like mutexes once the allocator gets hammered by many threads.

Hard to rank the rest without a proper breakdown. If I ever tried, I’d probably end up writing a paper — and I’d rather write code :)