> It's a little more than that though, using a pthread_mutex or even thread.park() on the slow path is less efficient than using a futex directly.
No, it absolutely isn’t.
The dominant cost of parking is whatever happens in the kernel and at the microarchitectural level when your thread goes to sleep. That cost is so dominant that whether you park with a futex wait or with a condition variables doesn’t matter at all.
(Source: I’ve done that experiment to death as a lock implementer back when I maintained Jikes RVM’s thin locks and then again when I wrote and maintained ParkingLot.)