Your confusion is very natural:
> It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then.
This gets at why this felt like a big deal when we ran into this. This is how it would work with threads. Tasks and futures hook into our intuitive understanding of how things work with threads. (And for tasks, that's probably still a fair mental model, as far as I know.) But futures within a task are different because of the inversion of control: tasks must poll them for them to keep running. The problem here is that the task that's responsible for polling this future has essentially forgotten about it. The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.
> tasks must poll them for them to keep running.
So async Rust introduces an entire novel class of subtle concurrent programming errors? Ugh, that's awful.
> The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.
Yes. But I've never written code in a preemptible protected mode environment like Linux userspace where it is possible to make that mistake. That's nuts to me.
From my POV this seems like a fundamental design flaw in async rust. Like, on a bare metal thing I expect to deal with stuff like this... but code running on a real OS shouldn't have to.
I definitely hear that!
To keep it in perspective, though: we've been operating a pretty good size system that's heavy on async Rust for a few years now and this is the first we've seen this problem. Hitting it requires a bunch of things (programming patterns and runtime behavior) to come together. It's really unfortunate that there aren't guard rails here, but it's not like people are hitting this all over the place.
The thing is that the alternatives all have tradeoffs, too. With threaded systems, there's no distinction in code between stuff that's quick vs. stuff that can block, and that makes it easy to accidentally do time-consuming (blocking) work in contexts that don't expect it (e.g., a lock held). With channels / message passing / actors, having the receiver/actor go off and do something expensive is just as bad as doing something expensive with a lock held. There are environments that take this to the extreme where you can't even really block or do expensive things as an actor, but there the hidden problem is often queueing and backpressure (or lack thereof). There's just no free lunch.
I'd certainly think carefully in choosing between sync vs. async Rust. But we've had a lot fewer issues with both of these than I've had in my past experience working on threaded systems in C and Java and event-oriented systems in C and Node.js.
Rust can't assume you're running on a real OS though.