Would you recommend this as a "thread pool" / coroutine scheduler replacement for an application web server?

Yes, I was planning a similar experiment with UCall (https://github.com/unum-cloud/ucall), leveraging the NUMA functionality introduced in v2 of Fork Union. I don’t currently have the right hardware to test it properly, but it would be very interesting to measure how pinning behaves on machines with multiple NUMA nodes, NICs, and a balanced PCIe topology.

Tokio actually has some similarities with Rayon. Tokio is used in most Rust web servers, like Axum and Actix-web

That’s true — though in my benchmarks Tokio came out as one of the slower parallelism-enabling projects. The article still included a comparison:

  $ PARALLEL_REDUCTIONS_LENGTH=1536 cargo +nightly bench -- --output-format bencher
  
  test fork_union ... bench:  5,150 ns/iter (+/- 402)
  test rayon ... bench:      47,251 ns/iter (+/- 3,985)
  test smol ... bench:       54,931 ns/iter (+/- 10)
  test tokio ... bench:     240,707 ns/iter (+/- 921)
... but I now avoid comparing to Tokio since it doesn’t seem fair — fork-join style parallel processing isn’t really its primary use case.

That's outrageous.. and I don't agree with your assessment, because smol is in the same niche as Tokio (that is, an async execuutor, which isn't necessarily optimizing for CPU-bound workloads) and isn't nearly as slow.

I think performance is a very critical property for Rust infrastructure. One can only hope that newer Tokio versions could address overheads which make everyone slower than necessary.