This started as a throwaway metaphor in a blog post, but is now fully runnable: a toy RTOS with preemptive multitasking inside of Super Mario Bros. on the NES.

Essentially, this is:

- A rudimentary preemptive RTOS

- Using an unmodified NES emulator (FCEUX) as the CPU

    - "Unmodified" depending on how you define terms
- With emulator save states as the thread contexts

- With support for (very basic) mutexes, interrupt masking, and condition variables

- Demonstrated using Super Mario Bros. 1-1 with sections of the map dedicated to various synchronization primitives

There are many simplifications and shortcuts taken (doesn't even have task priorities), and it doesn't map 1:1 to true multithreading (e.g., emulator save states represent the state of the entire machine including RAM, whereas thread contexts represent a much more minimal slice), but I think it's A) pretty interesting and B) a unique visceral explanation of threads.

That's pretty neat.

The metaphor I usually go for when it comes to for threads is having several widgets that need to be assembled. To begin working on Widget B, you have to pause the work on Widget A. Adding SMP is like adding a second person to help assemble widgets, but you're still using the same toolbox, so if you both need the same screwdriver, there's no performance benefit to having a second person. Multicore is having multiple workbenches each with their own toolbox so they can operate completely in parallel.

A mutex is like having a specialized tool that can only be used by one widget assembly at a time. Like, each workbench might have a full set of screwdrivers, hammers, wrenches, sockets, etc., but you only have 1 welder.

A semaphore is a tool that can be used by a limited number of widgets, like an oven that can fit up to 4 widgets.

I think this is a pretty damn good metaphor for the introduction of these concepts, especially in a context where the learner(s) might need to imminently apply this knowledge.

What drove me to write this post (as someone who didn't attend college where I might've learned exactly what I'm about to say -- big grain of salt) is that it took me being forced by circumstance to become a firmware engineer for a time to actually understand not just how to multithread well, but what the hell this all even was.

And I guess I saw one too many Reddit comments (mistake #1) which downplayed the idea that there's any benefit to knowing the answer to such things.

That all said, this blog post / metaphor is definitely aimed at someone who's already been using threads effectively for some time -- I don't think it'd fare very well as a first introduction.

As a special challenge: explain (Software) Transactional Memory, as eg implemented in Haskell, in your metaphor.

STM is pretty similar to how some databases implement transactions.

[dead]

SMP is a second processor, even more redundant than a second core. Its own cache, memory, Northbridge access, etc.

Yeah, I messed up SMT vs SMP. Can't edit it now, though.

Ah well -- the article itself has its own fair share of accuracy issues -- and I learned something from your conversation, which wouldn't have happened without said mistake.

> Adding SMP is like adding a second person to help assemble widgets, but you're still using the same toolbox, so if you both need the same screwdriver, there's no performance benefit to having a second person.

This sounds like hyperthreading – or, maybe I'm missing the screwdriver in the metaphor :)

Hyperthreading is Intel's implementation of SMP

I think you mean SMT (simultaneous multithreading), as opposed to SMP (symmetric multiprocessing)?

Ah hell, that was MY mistake. It should be SMT in my original comment and it's way too late to edit it now.

Funny thing is, when I was first typing that comment, I started with "Hyperthreading" and deleted it in favor of SMP to be vendor agnostic.

EDIT: And it should have been SMT anyways!

This is a super cool visual demonstration of RTOS/scheduling! I love the region-based critical sections!

I took a real-time operating systems course in university as an elective. One of the hardest courses I took the whole four years, but also one of the most interesting. Had a great professor, who gave really demanding, but very instructive, project-based assignments.

I need to find a toy project to play around with this domain again.

Thank you!

I'm curious how effective you feel this specific example might've been if it were delivered during your course. I suspect I've stumbled across a really helpful teaching tool, but having not gone to university, I don't actually know how this stuff is being taught :v

Nor I, but I've learned from the metaphor, for what that's worth. And the demystification of primitives has considerable value these days.

> Nor I, but I've learned from the metaphor, for what that's worth.

It's the sickest possible outcome for me, so: worth a lot!

This is simply amazing work, I take my hat off to you. And not just the threading in an emulator (which is a genius concept all by itself) but for the article itself - I have spent years trying to find the right way to articulate exactly your points in hopes of engaging people to learn more about the systems they’re building on top of.

From now on, I’ll simply send out a link to this. Thank you!

My person!

I worried that I'd come across as a high-horse dickhead in the mystification section, when that's the fundamental opposite of my intention and attitude here. I'm glad to hear that it resonated with another.

Although I guess we could both be high-horse dickheads.

IMO the mystification section is right on the money, I couldn’t have said it better myself (and believe me, I’ve tried).

The mystification section was my favorite part - you make some excellent points there.

Not only that but now, thanks to you, we can all be high-horse dickheads concurrently.

Sorry to be an pedant.

From your previous post

> The first thing to understand about threads is that processors do not support them.

Yes, but I would say by changing the emulator (augmenting the emulator software with your other software, sure) in the outside world, not doing something in the inside world, you have are implementing something closer akin to hardware concurrency support or processes.

> It doesn't map 1:1 to true multithreading (e.g., emulator save states represent the state of the entire machine including RAM, whereas thread contexts represent a much more minimal slice

Right, but I think if we're going to make a threads vs process distinction (vs concurrency in general) whether we have only a small opt-in set of communication channels / synchronization primitives (the mutex and barrier), or a wide open one (shared memory, be careful!) is the operative distinction.

Truly appreciate the thoughtful response.

First off: you're absolutely correct, and in ways which pain me to admit, because I do want to create correct analogies, as opposed to ones which mislead or half-inform.

I racked my brain, during the writing of this post, to figure out how to resolve these inconsistencies without sacrificing the goal of the post, which was basically to "unlock" an understanding of:

"The things you think are magic are actually just well-designed systems which you can, and should, reason about."

I think most programmers who use threads but don't know what they are have likely primarily had threads explained to them via analogies which teach their effects, not their nature. While this makes a lot of sense, and this instruction is certainly necessary and very valuable, it also has the unfortunate effect of mystifying and deifying such systems.

I also think most programmers would have no trouble accepting, or perhaps stating, that a thread is a data structure. But the nature of that data structure -- what goes inside it? -- has, in my experience, not been well understood.

And I truly think the core thing one has to do to demystify threads is explain what a thread holds -- a snapshot of CPU state -- because it links together the two most important things:

1. What would a thread structure even contain? 2. What features of a processor would even enable threads to function?

And with that as my primary motivation, I will defend the usefulness of this analogy. While the exact "state" being swapped is very different (which certainly demands explanation in any rigorous instruction), the fact that "it's simply state being swapped around, by code you can read, at an interval combined with conditions" is what I truly believe is the missing keystone -- and I do think this way of demonstrating it is useful to that degree.

But I am planning a part 3 which goes into fine detail on all of the disanalogies in part 2.

(Finally, I think my biggest sin of this whole thing is using the term "RTOS" in the title. That was stupid. This is not an RTOS at all. I didn't even use the term anywhere in the blog post itself, so why use it in the HN title? I regret this.)

Last night I read an article about eliminating lag in emulators. It's done with a similar concept. Basically, for each frame, the emulator calculates the state for different button combinations. Then, based on the button you actually push, the state shown moves to the precalculated one.

Not 100% familiar with exactly what this is but I am familiar with run-ahead, which is basically also the same idea as GGPO, but for eliminating lag:

- Run the emulator one frame normally, using real polled input. Somehow snapshot the state.

- Then, run the emulator n frames (usually just one) with the same input. Present the video + audio from the last frame.

- Synchronize. (Some emulators can get very fancy; instead of just waiting for vsync, they'll also delay until the end of the window minus estimated processing time, to poll input at the last possible moment.)

- Roll back to the saved snapshot. (I believe you can also optimize if you know the inputs really didn't change, avoiding a lot of rollbacks at the cost of less predictable frame times.)

The main reason this is even a good idea is because most games will have some of their own processing latency by design, so jumping a frame or two ahead usually doesn't have any noticeable side-effects. This is a pretty cool idea since obviously modern computers with LCD screens have a lot more latency basically everywhere versus older simpler machines connected to CRTs.

Unfortunately, this sort of approach only works when your emulator's state is small enough and fast enough to restore.

I actually have been dying to experiment with designing an emulator to have fast incremental snapshots from the ground up to see if you could manage to make this feasible for more modern consoles. You could, for example, track dirty memory pages with userfaultfd/MEM_WRITE_WATCH, and design structures like JIT caches to be able to handle rewinding without having to drop the entire cache. I'm actually not sure that all emulators clear their caches upon loading state, but you know, more generally, I would like to know how fast and small you could get save states to be if you were designing for that from the ground up.

What do you mean by JIT cache in this case -- like an inline data cache, or like a cache for the jitted binary code?

The latter, though I'm just using it as an example really. I think for most emulator core designs today you wind up dumping a lot of ephemeral state when you rewind or load state (and sometimes even saving state has some overhead due to synchronization.)

i think this is kind of the inverse of how some emulators that add remote play deal with network lag... they will predict what control the opponent will give for the current frame and display it, but then when they actually receive the real opponent control some number of frames later, they will go back to that frame and re-calculate based on knowing the actual control input for that frame, and then catch up to the current frame. That way the game isn't waiting for remote inputs to refresh the screen, but can then reconcile once all information is known.

Interesting; which emulator(s) was this for? I have to imagine this strategy is mostly effective below a certain total complexity level -- however you want to define that -- and the more modern you get, the more you're lucky to even render a frame at all.

Sounds a lot like speculative execution for hardware.

Was just thinking the same, this is effectively branch prediction/precalculation.

Umm, that's brilliant. Is this technique employed in all modern emulators or a recent thing?

Isn't this more like coroutines / green threads? I guess it doesn't matter as they are essentially same concept apart from cooperative scheduling vs. preemptive scheduling.

I suppose the actual thing it is is closer to what you're describing, but the thing that it makes viscerally understandable is threads itself -- I hope.

This is probably a stupid question, but why real-time? Isn't it enough to "just" make an OS in SMB, doesn't adding real-time guarantees make it even more difficult, by restricting what allocator and scheduling algorithms you can use?

In any case, this looks like a really cool project.

I appreciate your humility in assuming that the term is correct, and your understanding must be off.

As a matter of fact, the truth is that this is not an RTOS, and I'm embarrassed that I put that term in the title. It wasn't intentionally malicious -- in the moment I felt it fit -- but it was clouded by the bias of wanting to create an interesting HN title.

Tsoding recently built coroutines for C from scratch. The concept of coroutines is related to threads, so I bet you’d like the video if you liked the article.

https://youtu.be/sYSP_elDdZw

Would be neat if the subworld memory was shared between the save states, to really drive home the mutex concept.

This is one of the core weaknesses of the analogy that I'm trying to figure out how to resolve: there's no shared memory, unlike actual threads.

I think the solution requires that the game code is what's executing threading syscalls, not the Lua around the game code.

It'd definitely be a super sick evolution to add true shared in-game memory.

I guess in a way the three Mario threads are more like processes? And they're using OS-level or shared-memory mutexes and condvars that can be shared between processes?

I see what you're getting at -- though then it raises the question: what would it mean for those processes to have multiple threads within them?

The way I've been looking at it, if I want to take this further, processes would maybe be _NES ROMs_, and each process can have multiple threads. Swapping between processes would inherently be more expensive than between threads within a process, which I think would be a really instructive aspect. Also, the whole idea of "sharing memory" between entirely different games is obviously ridiculous, which could be yet another teaching tool.

Then if we want to expand it even further, maybe multicore support would look like multiple emulator instances, connected via sockets to support cross-core IPC. You'd probably want to institute an "all the threads of a process have to be on the same core" rule (so they can locally share primitives in the same Lua script), which would be a great way to demonstrate the principle of building systems which adapt to their realistic constraints, as opposed to building systems which exactly model a textbook.

I've gotten ahead of myself, though.

Easier to leave out the parallelism and just look at concurrency.

For a game like Mario, you can split up the memory and decide what you want to be shared and what do you want to be per-thread. E.g. starting small you can rig up just a few fixed variables like scores or whatever to be persisted from one game to the next. trying to push that further without causing endless corruption should be fun :)

You have critical section in the bonus section behind the pipe. Maybe shared state protected by that critical section could be your shared memory, like the state of the coins there.

Brilliant visual demonstration, really helps build an intuition for the concepts at play.

[deleted]

This is a fantastic way of demonstrating threading and scheduling, well done!

Kind of pushing the definition of RTOS here, I was expecting something like using ACE to run a concurrent program, still an interesting and informative post.

You're goddamn right it's pushing it.

I struggled with how to convey this; the ultimate goal is to viscerally demonstrate what a thread IS, so I'm comfortable with it being... something strange... but I do want it to resemble actual concurrent systems to a useful degree.

Like, the thread scheduler code is userland code -- similar to, say, FreeRTOS -- but the "threads" themselves aren't exactly threads in an exact sense, since their context packages are entire save states for a console, not just its processor.

Also, the game code (the true userland?) has no ability to access the threading API whatsoever, which really harms the analogy.

So I went with RTOS because it's a preemptive scheduler with synchronous reschedule-on-block; where the scheduler is implemented in the same codebase as the code using it. But to be honest, nothing I know of fits.

I really enjoyed this, thank you.

Wow! So cool!

"...But first we need to talk about parallel universes."

If you're going to make this your DooM thing, clearly the next thing to do is grab a DooM engine and get to work. (PsyDoom has Lua support even).

But congrats. This is worth being included in a uni course.

It'd be really interesting to push this to that level -- the big challenge likely being that DooM's state (including video memory) is certainly much larger and expensive to swap than an NES save state.

One would have to get really clever...

… and introduce something like virtual memory?

[deleted]

[dead]