If any rust designers are lurking about here: what made you decide to go for the async design pattern instead of the actor pattern, which - to me at least - seems so much cleaner and so much harder to get wrong?
Ever since I started using Erlang it felt like I finally found 'the right way' when before then I did a lot of work with sockets and asynchronous worker threads. But even though it usually worked as advertised it had a large number of really nasty pitfalls which the actor model seemed to - effortlessy - step aside.
So I'm seriously wondering what the motivation was. I get why JS uses async, there isn't any other way there, by the time they added async it was too late to change the fundamentals of the language to such a degree. But rust was a clean slate.
Not a Rust designer, but a big motivation for Rust's async design was wanting it to work on embedded, meaning no malloc and no threads. This unfortunately precludes the vast majority of the design space here, from active futures as seen in JS/C#/Go to the actor model.
You can write code using the actor model with Tokio. But it's not natural to do so.
Kind of a tangent, but I think "systems programming" tends to bounce back and forth between three(?) different concerns that turn out to be closely related:
1. embedded hardware, like you mentioned
2. high-performance stuff
3. "embedding" in the cross-language sense, with foreign function calls
Of course the "don't use a lot of resources" thing that makes Rust/C/C++ good for tiny hardware also tends to be helpful for performance on bigger iron. Similarly, the "don't assume much about your runtime" thing that's necessary for bare metal programming also helps a lot with interfacing with other languages. And "run on a GPU" is kind of all three of those things at once.
So yeah, which of those concerns was async Rust really designed around? All of them I guess? It's kind of like, once you put on the systems programming goggles for long enough, all of those things kind of blend together?
> So yeah, which of those concerns was async Rust really designed around? All of them I guess?
Yes, all of them. Futures needed to work on embedded platforms (so no allocation), needed to be highly optimizable (so no virtual dispatch), and need to act reasonably in the presence of code that crosses FFI boundaries (so no stack shenanigans). Once you come to terms with these constraints--and then add on Rust's other principles regarding guaranteed memory safety, references, and ownership--there's very little wiggle room for any alternative designs other than what Rust came up with. True linear types could still improve the situation, though.
> so no virtual dispatch
Speaking of which, I'm kind of surprised we landed on a Waker design that requires/hand-rolls virtual dispatch. Was there an alternate universe where every `poll()` function was generic on its Waker?
In my view, the major design sin was not _forcing_ failure into the outcome list.
.await(DEADLINE) (where deadline is any non 0 unit, and 0 is 'reference defined' but a real number) should have been the easy interface. Either it yields a value or it doesn't, then the programmer has to expressly handle failure.
Deadline would only be the minimum duration after which the language, when evaluating the future / task, would return the empty set/result.
> Deadline would only be the minimum duration after which the language, when evaluating the future / task, would return the empty set/result.
This appears to be misunderstanding how futures work in Rust. The language doesn't evaluate futures or tasks. A future is just a struct with a poll method, sort of like how a closure in Rust is just a struct with a call method. The await keyword just inserts yield points into the state machine that the language generates for you. If you want to actually run a future, you need an executor. The executor could implement timeouts, but it's not something that the language could possibly have any way to enforce or require.
Does that imply a lot of syscalls to get the monotonic clock value? Or is there another way to do that?
On Linux there is the VDSO, which on all mainstream architectures allows you to do `clock_gettime` without going through a syscall. It should take on the order of (double digit) nanoseconds.
If the scheduler is doing _any_ sort of accounting at all to figure out any remote sort of fairness balancing at all, then whatever resolution that is probably works.
At least for Linux, offhand, popular task scheduler frequencies used to be 100 and 1000hz.
Looks like the Kernel's tracking that for tasks:
https://www.kernel.org/doc/html/latest/scheduler/sched-desig...
"In CFS the virtual runtime is expressed and tracked via the per-task p->se.vruntime (nanosec-unit) value."
I imagine the .vruntime struct field is still maintained with the newer "EEVDF Scheduler".
...
A Userspace task scheduler could similarly compare the DEADLINE against that runtime value. It would still reach that deadline after the minimum wait has passed, and thus be 'background GCed' at a time of the language's choice.
The issue is that no scheduler manages futures. The scheduler sees tasks, futures are just a struct. See discussion of embedded above: there is no “kernel esque” parallel thread
> But it's not natural to do so.
I tend to write most of my async Rust following the actor model and I find it natural. Alice Rhyl, a prominent Tokio contributor, has written about the specific patterns:
https://ryhl.io/blog/actors-with-tokio/
Oh I do too, and that's one of the recommendations in RFD 400 as well as in my talk. cargo-nextest's runner loop [1] is also structured as two main actors + one for each test. But you have to write it all out and it can get pretty verbose.
[1] https://nexte.st/docs/design/architecture/runner-loop/
The ‘Beware of cycles’ section at the end has some striking similarities with futurelock avoidance recommendations from the original article… not sure what to make of this except to say that this stuff is hard?
As a curious bystander, it will be interesting to see how the Zig async implementation pans out. They have the advantage of getting to see the pitfalls of those that have come before.
Getting back to Rust, even if not natural, I agree with the parent that the actor model is simply the better paradigm. Zero runtime allocation should still be possible, you just have to accept some constraints.
I think async looks simple because it looks like writing imperative code; unfortunately it is just obfuscating the complex reality underlying. The actor model makes things easier to reason about, even if it looks more complicated initially.
I think you can do a static list of actors or tasks in embedded, but it's hard to dynamically spin up new ones. That's where intra-task concurrency is helpful.
iiuc zig has thought about this specifically and there is a safe async-cancel in the new design that wasn't there in the old one.
I was wondering when someone would bring up Zig. I think it's fascinating how far it has come in the last couple of years and now the new IO interface/async implementation.
Question is - when will Zig become mature enough to become a legit choice next to say, Go or Rust?
I mean for a regular dev team, not necessarily someone who works deeply along with Andrew Kelley etc like Tigerbeetle.
Rust async still uses a native stack which just a form of memory allocator that uses LIFO order. And controlling stack usage in the embedding world is just as important as not relying on the system allocator.
So its a pity that Rust async design tried so hard to avoid any explicit allocations rather than using an explicit allocator that embedding can use to preallocate and reuse objects.
> a native stack [is] just a form of memory allocator
There is a lot riding on that “just”. Hardware stacks are very, very unlike heap memory allocators in pretty much every possible way other than “both systems provide access to memory.”
Tons and tons of embedded code assumes the stack is, indeed, a hardware stack. It’s far from trivial to make that code “just use a dummy/static allocator with the same api as a heap”; that code may not be in Rust, and it’s ubiquitous for embedded code to not be written with abstractions in front of its allocator—why would it do otherwise, given that tons of embedded code was written for a specific compiler+hardware combination with a specific (and often automatic or compiler-assisted) stack memory management scheme? That’s a bit like complaining that a specific device driver doesn’t use a device-agnostic abstraction.
During the design phase of Rust async there were no async embedded code written to be inspired. For systems with tight memory budget it is common to pre-allocate everything often using custom bump allocation or split memory into few regions for fixed-sized things and allocate from those.
And then the need to poll features by the runtime means that async in Rust requires non-trivial runtime going against the desire to avoid abstractions in the embedded.
Async without polling while stack-unfriendly requires less runtime. And if Rust supported type-safe region-based allocators when a bunch of things are allocated one by one and then released at once it could be a better fit for the embedded world.
Stack allocation/deallocation does not fragment memory, that’s a yuge difference for embedded systems and the main reason to avoid the heap
Even with the stack the memory can fragment. Just consider one created 10 features on the stack and the last completed last. Then memory for the first 9 will not be released until the last completes.
This problem does not happen with a custom allocator where things to allocate are of roughly the same size and allocator uses same-sized cells to allocate.
Indeed, arena allocators are quite fast and allow you to really lock down the amount of memory that is in use for a particular kind of data. My own approach in the embedded world has always been to simply pre-allocate all of my data structures. If it boots it will run. Dynamic allocation of any kind is always going to have edge cases that will cause run-time issues. Much better to know that you have a deterministic system.
Why would the actor model require malloc and/or threads?
You basically have a concurrency-safe message queue. It would be pretty limited without malloc (fixed max queue size).
Your application needs concurrency. So, the answer is... switch your entire application, code style, and libraries it uses into a separate domain that is borderline incompatible with normal one? And has its own dialects that have their own compatibility barriers? Doesn't make sense to me.
It's easier to write all applications, concurrent or not, in a style that works well for concurrency. Lots of applications can benefit from concurrency.
You can do straight line, single threaded, non concurrent code in an actor model. Mostly, that's what most of the actor code will look like. Get a message, update local state in a straight forward way, send a response, repeat.
_an answer_ is performance - the necessity of creating copyable/copied messages for inter-actor communication everywhere in the program _can be_ expensive.
that said there are a lot of parts of a lot of programs where a fully inlined and shake optimized async state machine isn't so critical.
it's reasonable to want a mix, to use async which can be heavily compiler optimized for performance sensitive paths, and use higher level abstractions like actors, channels, single threaded tasks, etc for less sensitive areas.
I’m not sure this is actually true? Do messages have to be copied?
if you want your actors to be independent computation flows and they're in different coroutines or threads, then you need to arrange that the data source can not modify the data once it arrives at the destination, in order to be safe.
in a single threaded fully cooperative environment you could ensure this by implication of only one coroutine running at a time, removing data races, but retaining logical ones.
if you want to eradicate logical races, or have actual parallel computation, then the source data must be copied into the message, or the content of the message be wrapped in a lock or similar.
in almost all practical scenarios this means the data source copies data into messages.
Rust solves this at compile-time with move semantics, with no runtime overhead. This feature is arguably why Rust exists, it's really useful.
Rust moves are a memcpy where the source becomes effectively unitialized after the move (that is say it is undefined to access it after the move). The copies are often optimized by the compiler but it isn't guaranteed.
This actually caused some issues with rust in the kernel because moving large structs could cause you to run out the small amount of stack space availabe on kernel threads (they only allocate 8-16KB of stack compared to a typical 8MB for a userspace thread). The pinned-init crate is how they ended solving this [1].
[1] https://crates.io/crates/pinned-init
if you can always move the data that's the sweet spot for async, you just pass it down the stack and nothing matters.
all of the complexity comes in when more than one part of the code is interested in the state at the same time, which is what this thread is about.
Isn't that something Rust is particularly good at, controlling the mutation of shared memory?
yes
In Rust wouldn’t you just Send the data?
I'd recommend watching this video: https://www.infoq.com/presentations/rust-2019/; and reading this: https://tokio.rs/blog/2020-04-preemption
I'm not the right person to write a tl;dr, but here goes.
For actors, you're basically talking about green threads. Rust had a hard constraint that calls to C not have overhead and so green threads were out. C is going to expect an actual stack so you have to basically spin up a real stack from your green-thread stack, call the C function, then translate it back. I think Erlang also does some magic where it will move things to a separate thread pool so that the C FFI can block without blocking the rest of your Erlang actors.
Generally, async/await has lower overhead because it gets compiled down to a state machine and event loop. Languages like Go and Erlang are great, but Rust is a systems programming language looking for zero cost abstractions rather than just "it's fast."
To some extent, you can trade overhead for ease. Garbage collectors are easy, but they come with overhead compared to Rust's borrow checker method or malloc/free.
To an extent it's about tradeoffs and what you're trying to make. Erlang and Go were trying to build something different where different tradeoffs made sense.
EDIT: I'd also note that before Go introduced preemption, it too would have "pitfalls". If a goroutine didn't trigger a stack reallocation (like function calls that would make it grow the stack) or do something that would yield (like blocking IO), it could starve other goroutines. Now Go does preemption checks so that the scheduler can interrupt hot loops. I think Erlang works somewhat similarly to Rust in scheduling in that its actors have a certain budget, every function call decrements their budget, and when they run of of budget they have to yield back to the scheduler.
Indeed, in Erlang the budget is counted in 'reductions'. Technically Erlang uses the BEAM as a CPU with some nifty extra features which allow you to pretend that you are pre-empting a process when in fact it is the interpreter of the bytecode that does the work and there are no interrupts involved. Erlang would not be able to do this if the Erlang input code was translated straight to machine instructions.
But Go does compile down to machine code, so that's why until it did pre-emption it needed that yield or hook.
Come to think of it: it is strange that such quota management isn't built into the CPU itself. It seems like a very logical thing to do. Instead we rely on hardware interrupts for pre-emption and those are pretty fickle. It also means that there is a fixed system wide granularity for scheduling.
Fickle? Pray tell, when the OS switch your thread for another thread, in what way does that fickleness show?
I take it you've never actually interfaced directly with hardware?
Interrupts are at the most basic level an electrical signal to the CPU to tell it to load a new address into the next instruction pointer after pushing the current one and possibly some other registers onto the stack. That means you don't actually know when they will happen and they are transparent to the point that those two instructions that you put right after one another are possibly detoured to do an unknown amount of work in some other place.
Any kind of side effect from that detour (time spent, changes made to the state of the machine) has the potential to screw up the previously deterministic path that you were on.
To make matters worse, there are interrupts that can interrupt the detour in turn. There are ways in which you can tell the CPU 'not now' and there are ways in which those can be overridden. If you are lucky you can uniquely identify the device that caused the interrupt to be triggered. But this isn't always the case and given the sensitivity of the inputs involved it isn't rare at all that your interrupt will trigger without any ground to do so. If that happens and the ISR is not written with that particular idea in mind you may end up with a system in an undefined state.
Interrupts are a very practical mechanism. But they're also a nightmare to deal with in the otherwise orderly affairs of computing and troubleshooting interrupt related issues can eat up days, weeks or even months if you are really unlucky.
I'm surprised learning this too. I know the hobby embedded, and HTTP-server OSS ecosystem have committed to Async, but I didn't expect Oxide would.
We actually don't use Rust async in the embedded parts of our system. This is largely because our firmware is based on a multi-tasking microkernel operating system, Hubris[1], and we can express concurrency at the level of the OS scheduler. Although our service processors are single-core systems, we can still rely on the OS to schedule multiple threads of execution.
Rust async is, however, very useful in single-core embedded systems that don't have an operating system with preemptive multitasking, where one thread of execution is all you ever get. It's nice to have a way to express that you might be doing multiple things concurrently in an event-driven way without having to have an OS to manage preemptive multitasking.
[1] https://hubris.oxide.computer/
Heh, this is super interesting to hear. Single threaded async/concurrent code is so fun and interesting to see. I’ve ran some tokio programs in single threaded mode just to see it in action