I tried optimizing my CPU emulator dispatch in raw assembly to see if I could run a simple fibonacci program faster than C++. And I was not even close. In the end I merged it and made it a default-disabled dispatch option, because ... there has to be a way to make it faster!

If you are daring, you can find my puny attempt here: https://github.com/libriscv/libriscv/blob/master/lib/librisc...

I did manage to improve it once I figured out some of the various modes of accessing memory, and I even managed to cut the jump table down from 64- to 32-bit which should help keep it in memory. I made the jump table part of .text in order to make it RIP-relative. For the fibonacci sequence program, not many bytecodes are needed. I would greatly appreciate some tips on what can be improved there.

> I made the jump table part of .text in order to make it RIP-relative

Did you do this manually?

gcc changed to put jump tables in .rodata always which causes problems when .rodata is stored in ROM.

It does have the `-fno-jump-tables` option but that just disables jump tables rather than allowing you to control where they go.

have you tried actually comparing what you have to what the c++ compiler generates? i don't have a lot of context here but I think it's possible that the difference is less due to the dispatch mechanism (i.e. how the next instruction is fetched) but due to the implementations of the instructions themselves.

one opportunity for optimization is mapping emulated registers to real x86-64 registers and basically never spilling them to memory (so that way if you have to add you don't have to first fetch then add, but just add directly). though that makes writing the emulator a lot more annoying.