Great read, and the example code makes sense. This stuff can be a nightmare to find, but once you do it's like a giant 1000 piece puzzle just clicks together instantly.

Indeed. One of the interesting side effects of being a remote company that records everything[0] is that we have the instant where the "1000 piece puzzle just clicks together" recorded, and it's honestly pretty wild. In this case, it was very much a shared brainstorming between four engineers (Eliza, Sean, John and Dave) -- and there is almost a passing of the baton where they start to imagine the kind of scenario that could induce this and then realize that those are exactly the conditions that exist in the software.

We are (on brand?) going to do a podcast episode on this on Monday[1]; ahead of that conversation I'm going to get a clip of that video out, just because it's interesting to see the team work together to debug it.

[0] https://rfd.shared.oxide.computer/rfd/0537

[1] https://discord.gg/QrcKGTTPrF?event=1433923627988029462

As a member of (Eliza, Sean, John, and Dave), I can second that debugging this was certainly an adventure. I'm not going to go as far as to say that we had fun, since...you can't have a heroic narrative without real struggle. But it was certainly rewarding to be in the room for that "a-ha!" moment, in which all the pieces really did begin to fit together very quickly. It was like the climax of a detective story --- and it was particularly well-scripted the way each of us contributed a little piece of the puzzle.

Since you are of of the people working directly on this codebase, may I ask you why is select! being used/allowed in the first place?

Its footgun-y nature has been known for years (IIRC even the first version of the tokio documentation warned against that) and as such I don't really understand why people are still using it. (For context I was the lead of a Rust team working on a pretty complex async networking program and we had banned select! very early in the project and never regretted this decision once).

What to use instead?

Your favorite llm will give you several good options. You can use an mpsc channel where you have a task per sender which waits on each branch in the select and then sends a message and then just wait on the receiver end of that channel. Or you could use https://docs.rs/futures/latest/futures/future/fn.select.html (or the select_all version). Both make it more obvious what is going on. The last way would be to implement a future manually. This would probably be my favorite option in many cases as it would be least overhead, but if you've never implemented a future it may be a bit intimidating.

Edit: tokio docs actually provide several suggestions: https://docs.rs/tokio/latest/tokio/macro.select.html#alterna...