> A Haskell analysis showed that integer workloads (the most common in normal computing) saw some 4.8w out of 22.1w dedicated to the decoder.

You are mixing studies, the Haskell paper only looked at total core paper (and checked how it was impacted by algorithms). It was this [1] study that provided the 4.8w out of 22.1w number.

And while it's an interesting number, it says nothing about the overhead of decoding x86. They chose to label that component "instruction decoder" but really they were measuring the whole process of fetching instructions from anywhere other than the μop cache.

That 4.8w number includes the branch predictor, the L1 instruction cache, potentially TLB misses and instruction fetch from L2/L3. And depending on the details of their regression analysis, it might also include things after instruction decoding, like register renaming and move elimination. Given that they show the L1 data cache as using 3.8w, it's entirely possible that the L1 instruction cache actually makes up most of the 4.8w number.

So we really can't use this paper to say anything about the overhead of decoding x86, because that's not what it's measuring.

[1] https://www.usenix.org/system/files/conference/cooldc16/cool...