have you tried actually comparing what you have to what the c++ compiler generates? i don't have a lot of context here but I think it's possible that the difference is less due to the dispatch mechanism (i.e. how the next instruction is fetched) but due to the implementations of the instructions themselves.
one opportunity for optimization is mapping emulated registers to real x86-64 registers and basically never spilling them to memory (so that way if you have to add you don't have to first fetch then add, but just add directly). though that makes writing the emulator a lot more annoying.