I have very little Rust experience... but I'm hung up on this:
> The lock is given to future1
> future1 cannot run (and therefore cannot drop the Mutex) until the task starts running it.
This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.
Why would do_async_thing() not immediately run the prints, return, and drop the lock after acquiring it? Why does future1 need to be "polled" for that to happen? I get that due to the select! behavior, the result of future1 is not consumed, but I don't understand how that prevents it from releasing the mutex.
It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then. Having to take some explicit second action to make that happen seems fundamentally broken to me...
EDIT: Rephrased for clarity.
> This seems like a contradiction to me. How can future1 acquire the Mutex in the first place, if it cannot run? The word "given" is really odd to me.
`future1` did run for a bit, and it got far enough to acquire the mutex. (As the article mentioned, technically it took a position in a queue that means it will get the mutex, but that's morally the same thing here.) Then it was "paused". I put "paused" in scare quotes because it kind of makes futures sound like processes or threads, which have a "life of their own" until/unless something "interrupts" them, but an important part of this story is that Rust futures aren't really like that. When you get down to the details, they're more like a struct or a class that just sits there being data unless you call certain methods on it (repeatedly). That's what the `.await` keyword does for you, but when you use more interesting constructs like `select!`, you start to get more of the details in your face.
It's hard to be more concrete than that without getting into an overwhelming amount of detail. I wrote a set of blog posts that try to cover it without hand-waving the details away, but they're not short, and they do require some Rust background: https://jacko.io/async_intro.html
So my understanding was correct, it requires the programmer to deal with scheduling explicitly in userspace.
If I'm writing bare metal code for e.g. a little cortex M0, I can very much see the utility of this abstraction.
But it seems like an absolutely absurd exercise for code running in userspace on a "real" OS like Linux. There should be some simpler intermediate abstraction... this seems like a case of forcing a too-complex interface on users who don't really require it.
There is one: tasks. But having the lower level (futures) available makes it very tempting to use it, both for performance and because the code is simpler (at least, it looks simpler). Some things that are easy with select! would be clunky with tasks.
On the other hand, some direct uses of futures are reminiscent of the tendency to obsess over ownership and borrowing to maximize sharing, when you could just use .clone() and it wouldn’t make any practical difference. Because Rust is so explicit, you can see the overhead so you want to minimize it.
To be clear, if you restrict yourself to `async`/`.await` syntax, you never see any of this. To await something means to poll it to completion, which is usually what you want. "Joining" two futures lets you poll both of them concurrently until they're both done, which is kind of the point of async as a concept, and this also doesn't really require you to think about scheduling. One place where things get hairy (like in this article) is "selecting" on futures, which polls them all until one of them is done, and then stops polling the rest. (Normally I'd loosely say it "drops the rest on the floor", but the deadlock in this article actually hinges on exactly what gets "dropped" when, in the Rust sense of the `Drop` trait.) This is where scheduling as you put it, or "cancellation" as Rust folks often put it, starts to become important. And that's why the article concludes "In the end, you should always be extremely careful with tokio::select!" However, `select!` is not the only construct that raises these issues. Speaking of which...
> But it seems like an absolutely absurd exercise for code running in userspace on a "real" OS like Linux
Clearly you have a point here, which is why these blog posts are making an impact. That said, one counterpoint is, have you ever wished you could kill a thread? The reason there are so many old Raymond Chen "How many times does it have to be said: Never call TerminateThread" blog posts, is that lots of real world applications really desperately want to call TerminateThread, and it's hard to persuade them to stop! The ability to e.g. put a timeout on any async function call is basically this same superpower, without corrupting your whole process (yay), but still with the unavoidable(?) difficulty of thinking about what happens when random functions give up halfway through.
Your confusion is very natural:
> It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then.
This gets at why this felt like a big deal when we ran into this. This is how it would work with threads. Tasks and futures hook into our intuitive understanding of how things work with threads. (And for tasks, that's probably still a fair mental model, as far as I know.) But futures within a task are different because of the inversion of control: tasks must poll them for them to keep running. The problem here is that the task that's responsible for polling this future has essentially forgotten about it. The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.
> tasks must poll them for them to keep running.
So async Rust introduces an entire novel class of subtle concurrent programming errors? Ugh, that's awful.
> The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.
Yes. But I've never written code in a preemptible protected mode environment like Linux userspace where it is possible to make that mistake. That's nuts to me.
From my POV this seems like a fundamental design flaw in async rust. Like, on a bare metal thing I expect to deal with stuff like this... but code running on a real OS shouldn't have to.
I definitely hear that!
To keep it in perspective, though: we've been operating a pretty good size system that's heavy on async Rust for a few years now and this is the first we've seen this problem. Hitting it requires a bunch of things (programming patterns and runtime behavior) to come together. It's really unfortunate that there aren't guard rails here, but it's not like people are hitting this all over the place.
The thing is that the alternatives all have tradeoffs, too. With threaded systems, there's no distinction in code between stuff that's quick vs. stuff that can block, and that makes it easy to accidentally do time-consuming (blocking) work in contexts that don't expect it (e.g., a lock held). With channels / message passing / actors, having the receiver/actor go off and do something expensive is just as bad as doing something expensive with a lock held. There are environments that take this to the extreme where you can't even really block or do expensive things as an actor, but there the hidden problem is often queueing and backpressure (or lack thereof). There's just no free lunch.
I'd certainly think carefully in choosing between sync vs. async Rust. But we've had a lot fewer issues with both of these than I've had in my past experience working on threaded systems in C and Java and event-oriented systems in C and Node.js.
Rust can't assume you're running on a real OS though.